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Abstract 


A  parser  model  is  presented  whose  structure  is  a  generalization  of 
that  of  the  well  known  LR(k)  parsers.  In  particular,  various  classes 
of  parsers  that  V70uld  be  both  practical  and  efficient  to  use  in  a  compiler 
are  examined.  Associated  with  these  classes  of  parsers  is  a  hierarchy  of 
type-0  grammars,  each  grammatical  class  being  defined  in  terms  of  the 
form  and  structure  of  derivations. 

Deterministic  regular  parsable  (DRP)  grammars  are  of  special 
interest.  Parsers  based  on  these  grammars  will  detect  any  errors  as  soon 
as  possible  during  a  left  to  right  scan  of  the  input.  An  LR(k)  grammar 
is  also  DRP.  A  practical  parser  generator,  applicable  to  some  DRP  grammars, 
is  developed. 

Closure  and  decidability  results  of  this  new  hierarchy  are  examined. 
Of  particular  interest  is  the  fact  that  there  is  no  parser  generator 
capable  of  constructing  a  parser  that  will  detect  errors  as  soon  as 
possible  in  a  parse,  for  all  DRP  grammars.  Such  an  algorithm  exists 
for  LR(k)  grammars. 

Much  of  the  research  related  to  LR(k)  parsing  is  applicable 
to  the  new  parsers  we  discuss  in  this  thesis.  For  example,  syntax- 
directed  transduction  schemes  and  parser  size  reduction  techniques  can 
be  extended.  Moreover,  the  use  of  our  generalized  parsers  allows  us  to 
eliminate  look  ahead  transitions  in  all  these  parsers  (including  LR(k) 
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parsers) ,  This  transformation  will  enhance  the  success  of  certain 
table  reduction  methods. 
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INTRODUCTION 


A  compiler  is  a  program  which  is  executed  by  a 
digital  computer  and  which  translates  a  program  Cthe  source 
program) ,  written  in  some  known  input  language,  into  an 
equivalent  program  (the  object  program)  suitable  for 
execution  in  some  machine  environment.  One  of  the  important 
components  of  any  compiler  is  a  parser.  It  is  responsible 
for  the  conversion  of  a  string  of  source  tokens,  which  are 
almost  always  in  one  to  one  correspondence  with  elements 
of  the  source  program,  into  a  sequence  of  phrases  suitable 
for  use  in  the  synthesis  of  the  object  program. 

The  parser  is  concerned  only  with  the  syntactic 
structure  of  the  program.  Informally,  starting  with  the 
input  token  string,  the  parser,  step  by  step,  rewrites  the 
string,  and  if  the  program  is  error  free  it  terminates  when 
some  particular  goal  string  has  been  reached.  Each  such 
rewriting  step  may  be  associated  with  a  phrase  used  in 
constructing  the  object  program.  The  rewriting  of  the  token 
string  is  accomplished  with  the  assistance  of  a  set  of 
rewriting  rules  which  embody  the  syntax  of  all  error  free 
programs . 

The  Chomsky  hierarchy  of  languages  CHopcroft  and 
Ullman  1969)  is  often  studied  in  connection  with  the  problems 
of  parsing  and  compilation.  This  classification  is  identified 
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with  the  concept  of  a  formal  grammar  as  a  finite  description 
of  languages.  In  these  formal  grammars,  the  rewriting  rules 

I 

are  usually  called  productions.  There  is  an  extensive  theory 
concerning  the  properties  of  the  various  classes  of 
Chomsky  grammars,  languages,  and  devices  which  recognize  j 

them.  This  thesis  adds  new  insight  into  these  languages  and 

practical  parsing  devices  for  them,  I 

i 

i 

I 

I 

It  is  important  that  a  parser  perform  well  in  accepting  ! 

I 

and  rejecting  a  source  program.  For  example,  if  a  program  | 

contains  errors,  the  parser  must  reject  it  and  produce 
sufficient  information  so  that  the  programmer  can  identify 
and  correct  the  problem.  It  is  advantageous  if  the  parser  ! 

can  be  restarted  after  detecting  an  error  so  ,that  a  search  ; 

'  I 

for  additional  errors  can  be  made.  The  detection  of  an  error  I 
as  early  as  possible  in  the  parse  is  also  desirable.  There  | 
are  several  reasons  for  this.  Such  a  parser  will  not  apply  ] 
rewriting  rules  to  that  part  of  the  input  that  contains  the 

i 

I 

error.  In  general  the  parser  will  be  able  to  indicate  to  1 

the  programmer  the  exact  location  of  the  error  in  the  input. 
Furthermore,  if  the  parser  is  restarted,  there  is  less  chance 

that  additional  errors  will  have  been  disguised  or  caused  by  th* 

! 

application  of  rewriting  rules  around  the  original  error. 

A  parser  must  also  be  efficient  in  its  use  of  time  and 
space.  These  aspects  of  practical  parsing  (accepting  and 
rejecting,  efficiency,  and  error  handling)  distinguish  it 
from  the  automata  theory  problem  of  language  recognition. 
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Background 


General  metkods  for  constructing  parsers  tkat  meet 
tkese  requirements  kave  appeared  in  tke  literature  for  over 
ten  years  now.  Almost  all  of  tkese  metkods  are  based  on 
tke  use  of  a  Ckomsky  formal  grammar  to  describe  tke  source 
language.  An  excellent,  altkougk  sligktly  out  of  date, 
survey  of  parser  construction  appears  in  CFe.ldman  and  Cries 
1968).  A  more  recent  review  of  tkis  topic  is  CAko  and 
U1 Iman  1972a) . 

One  of  tke  major  advantages  of  tke  formal  grammar 
approack  has  been  the  development  of  algorithms  that  will 
automatically  construct  a  parser  for  a  programming  language 
from  its  description.  Tkese  algorithms,  known  as  parser 
generators ,  are  extremely  useful  tools  for  the  compiler 
writer.  Tke  parsers  they  construct  are  not  susceptible  to 
programming  mistakes  and  tke  parser  generators  themselves 
make  it  easy  to  implement  changes  to  the  language's  syntax. 
Formalizing  tke  process  of  building  parsers  has  also  led  to 
a  far  more  organized  approack  to  tke  entire  task  of  compiler 
writing.  For  example,  formal  grammar  based  parsing  has  been 
used  to  formalize  tke  process  of  code  synthesis  CLewis  and 
Stearns  1968,  Ako  and  Ullman  1969) .  The  SLAP  translator 
writing  system  (Gorrie  1971)  is  an  example  of  tke  practical 
use  of  tkese  techniques. 


1-3 


An  Rlstorical  Review 


Tke  history  of  programming  language  parsing  is 
extensive;  we  shall  only  touch  on  some  of  the  highlights 
that  are  relevant  to  this  thesis.  In  particular,  we  shall 
concentrate  on  so-called  bottom  up  parsers  based  on 
Chomsky  formal  grammars.  All  the  parsing  methods  described 
below  have  implicitly  used  the  same  parser  model.  In  very 
general  terms,  a  parser 


is  composed  of  an  L-STACKL  Ca  stack),  an  R-STACK  Creally 
an  output  restricted  deque  (Knuth  1968)),  and  a  control. 
Information  about  the  form  of  the  productions  that  describe 
the  source  language  is  contained  in  the  control.  Initially 
the  input  resides  in  the  R-STACK,  beginning  at  the  left  end 
The  control  is  designed  so  that  it  either  shifts  a  symbol 
from  the  left  end  of  the  R-STACK  and  pushes  it  into  the 
L-STACK,  or  alternatively  it  applies  a  rewriting  rule  to 
the  top  of  the  L-STACK.  This  latter  operation,  also  known 
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as  a  reduction,  entails  removal  of  symbols  from  the  top  of 
the  L-STACK  and  insertion  of  some  related  symbols  (related 
by  the  productions)  into  the  left  end  of  the  R-STACK.  After 
the  application  of  a  rewriting  rule,  the  control  resumes 
the  transfer  of  symbols  from  the  R-STACK  to  the  L-STACK. 

If  a  distinguished  goal  string  appears  in  the  R-STACK  with 
the  L-STACK  empty,  then  the  input  is  accepted.  The  control  at 
some  point  may  decide  that  an  error  has  been  found  and  reject 
the  input.  We  shall  confine  our  interest  to  the  situation 
where  the  decision  by  the  control  to  shift  a  symbol  from  the 
R-STACK  to  the  L-STACK,  apply  a  rewriting  rule,  accept,  or 
reject  is  always  made  deterministically  without  reversal 
of  previous  decisions  (backtracking).  Otherwise,  there  is 
the  possibility  of  a  large  space  or  time  penalty,  which  we 
would  like  to  avoid. 

Shift  reduce  parsers  (Aho  and  Ullman  1972a)  operate 
exactly  as  our  model  does.  However,  since  they  are  based  on 
context  free  grammars,  they  may  only  push  a  single  symbol 
into  tire  R-STACK  during  a  reduction  operation.  Consequently, 
our  parsers  are  a  generalised  and  more  powerful  version  of 
shift  reduce  parsers.  Griffiths  and  Patrick  (Griffiths  and 
Patrick  1965)  have  also  used  a  two  stack  parser  to  compare 
the  efficiency  of  various  context  free  parsing  algorithms. 
Gilbert's  analytic  grammars  (Gilbert  1966)  also  form 
a  related  concept.  Informally,  an  analytic  grammar  is 
a  Chomsky  grammar  with  the  rex^zriting  rules  reversed  to  reflect 
the  fact  that  we  are  parsing,  not  generating,  strings.  These 
rules  are  applied  repeatedly  to  the  input  until  no  rule  may 
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be  applied  or  until  a  specified  goal  string  is  formed. 
Analytic  grammars  contain  a  device,  called  a  s  c  an ,  which 
reads  through  the  input  and  decides  which  rule  to  use  and 
where  to  use  it.  In  its  most  general  form,  the  analytic 
grammar  is  an  extremely  powerful  device. 

Since  our  parser  model  is  for  descriptive  purposes 
only,  we  shall  not  give  it  a  more  f  ormal^'def  ini  t  ion . 

The  weakness  of  all  the  parsers  we  will  discuss  is 
that,  apart  from  detecting  errors,  they  have  no  capacity 
for  issuing  error  messages  or  attempting  to  recover  from 
errors.  In  practice,  these  functions  are  carried  out  by 
some  error  control  which  may  be  considered  to  be  an  appendagt 
to  the  main  control  of  the  parser.  General  error  recovery 
techniques  based  on  the  actual  form  of  the  parser  have  been 

considered  in  (Leinius  1970),  (James  1972)  and  (Wynn  1973). 

Various  parsing  techniques  and  parser  generators 
differ  only  in  how  the  control  operates  and  how  it  is 

t 

constructed.  Research  into  parsing  based  on  formal  grammars 
Iras  been  directed  at  increasing  the  power  of  the  control  and, 
consequently,  enlarging  the  class  of  languages  that  can  be 
parsed.  It  is  from  this  point  of  view  that  we  shall  review 
some  well  known  parsing  techniques. 

Early  parsing  methods  with  a  formal  theoretical 
aspect  to  them  included  operator  precedence  parsing  CFloyd 
1963),  precedence  parsing  CWirth  and  Weber  1966),  extended 
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precedence  parsing  (Wirth  and  Weber  1966,  McKeeman  19  66)  , 
weak  precedence  parsing  (Ichbiah  and  Morse  1970)  ,  and 
transition  matrices  (Cries  1968) .  All  of  these  methods 
apply  only  to  a  subset  of  the  Chomsky  hierarchy  of  grammars 
(context  free  grammars),  and  all  but  the  latter  can  be 
succinctly  described  in  terms  of  Floyd's  bounded  context 
grammars  (Floyd  1964). 

A  context  free  grammar  is  an  (m,n)  bounded  context 
grammar  if  and  only  if  the  control  is  able  to  determine  that 
a  string,  u,  at  the  top  of  the  L-STACK  is  suitable  for  the 
application  of  a  rewriting  rule  by  examining  only  the  m 
symbols  below  u  in  the  L-STACK  and  the  first  n  symbols  in 
the  R-STACK.  The  precedence  grammars  in  (Wirth  and  Weber  1966) 
are  all  (1,1)  bounded  context  grammars.  The  class  of 
extended  precedence  grammars  properly  contains  precedence 
and  operator  precedence  grammars.  Extended  precedence 
and  weak  precedence  grammars  are  (m,n)  bounded  context 
grammars.  In  general,  transition  matrices  are  an  extremely 
powerful  technique;  however,  their  particular  application 
to  operator  grammars  in  (Cries  1968)  is  also  included  in  the 
bounded  context  concept. 

More  recently,  extensions  to  precedence  parsing  have 
been  made  by  Colmerauer  (Colmerauer  1970)  and  by  Cray 
and  Harrison  (Cray  and  Harrison  1973).  These  schemes  can 
also  be  described  in  terms  of  (1,1)  bounded  context  grammars. 
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An  important  theoretical  development  in  parsing  was 


to  extend  the  control  so  that  it  could  use  the  entire  contents 
of  the  L-STACK  as  data  upon  which  to  determine  whether  to 
shift,  apply  a  rewriting  rule,  accept,  or  reject.  This  was 
done  in  Knuth's  LR(k)  paper  (Knuth  1965).  An  LR(k)  control 
is  composed  of  states  which  represent  the  information 
contained  in  the  L-STACK.  The  control  makes  its  decision 
based  on  its  state  and  at  most  the  first  k  symbols  in  the 
R-STACK,  and  so,  it  may  be  thought  of  as  having  unlimited 
left  context  in  terms  of  (m,n)  bounded  context  grammars. 

LR(k)  grammars  correspond  to  the  largest  set  of  context 
free  languages  that  can  be  deterministically  parsed  from 
left  to  right,  looking  ahead  at  only  a  bounded  number  of 
tokens.  LR(k)  grammars  also  correspond  to  the  class  of 
context  free  languages  recognizable  by  a  deterministic  pushdown 
automaton  (DPDA). 


Knuth  describes  an  algorithm  to  construct  an  LR(k) 
parser  from  an  LR(k)  grammar.  In  actual  use,  this  parser 
generator  is  often  impractical.  De  Remer  (De  Remer  1969) 
and  Korenjak  (Korenjak  1969)  have  described  algorithms  that 
are  more  practical  and  which  construct  parsers  that  are 
more  space  efficient.  Aho  and  Ullman  (Aho  and  Ullman  1972b), 
Joliat  (Joliat  1973),  and  Anderson,  Eve,  and  Horning 
(Anderson  et  al  1973)  propose  algorithms  that  can  reduce  the 
size  of  these  parsers  even  further. 
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Apart  from  the  fact  that  LR(k)  parsing  methods  can 
handle  a  broader  class  of  grammars  than  bounded  context 
grammars,  LR(k)  parsers  also  have  very  good  error  detecting 
properties.  At  any  time  during  a  parse  using  LR(k)  methods, 
it  is  always  possible  that  the  unread  portion  of  the  original 
input  may  cause  the  parser  eventually  to  accept  the  input. 
Consequently,  any  error  is  found  as  soon  as  possible  in  the 
left  to  right  scan  of  the  input.  As  a  result, 

(a)  error  messages  are  more  likely  to  appear  at  a 
meaningful  location  in  the  source  program; 

(b)  error  recovery  will  be  easier  since  the  parser  has 
not  scanned  past  the  actual  error  and  need  not  undo 
any  reductions  it  may  have  made. 

For  compiler  writers  and  users,  these  features  of  LR(k) 
parsing  are  a  great  advantage.  Since  LR(k)  parsers  have 
proven  to  be  as  fast  as  many  bounded  context  techniques 
(Lalonde  et  al  1971,  Anderson  et  al  1973),  they  represent 
a  very  advantageous  parsing  method. 

LR(k)  grammars  are  as  far  as  we  can  go  in  the  area 
of  context  free  grammars,  with  our  parser  model.  These 
grammars  have  provided  a  good  model  for  current  programming 
languages,  although  not  without  some  inadequacies  (Floyd  1962) 
On  the  other  hand,  it  is  unlikely  that  the  full  power  of 
context  sensitive  grammars,  the  next  step  up  in  the  Chomsky 


1  ^ 
y 


hierarchy,  is  required.  As  a  result,  there  has  been  | 

I 

considerable  interest  in  defining  different  kinds  of  | 

grammars,  more  powerful  than  the  class  of  context  free  j 

I 

I 

grammars,  and  perhaps  more  suitable  as  a  model  for  programming! 

I 

I 

1 anguages .  | 

) 

f 

i 

Programmed  grammars  (Rosenkrantz  1969),  scattered  j 

context  grammars  (Griebach  and  Hopcroft  1969),  Van  Wijngaarder 
grammars  (Van  Wijngaarden  1969),  and  indexed  grammars 
(Aho  1968)  are  examples  of  efforts  in  this  area.  In  all 
cases,  the  emphasis  has  been  very  much  on  the  ability  of 

these  grammars  to  generate  a  class  of  languages  that  includes! 

1 

and  is  larger  than  the  set  of  context  free  languages.  Little' 
or  no  attention  has  been  paid  to  the  problem  of  parsing 
according  to  these  new  grammars. 

( 

Walters’  research  (Walters  1970)  takes  the  opposite 
approach  and,  with  context  sensitive  grammars  as  a  starting 
point,  describes  a  parser  which  operates  in  a  very  similar  j 
fashion  to  Knuth’s  LR(k)  parser  (the  procedure  for  constructir; 
these  parsers  is  also  analogous  to  Knuth’s  parser  generator). 
Although  all  languages  recognizable  by  a  deterministic  ' 

linear  bounded  automaton  have  a  grammar  for  which  Walters’ 
procedure  will  construct  a  parser,  this  procedure  is  not  * 

an  algorithm  (that  is,  his  procedure  is  not  guaranteed  to  ' 

halt).  Thus  his  research  is  also  of  more  theoretical  than 
practical  interest . 
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Haskell  (Haskell  1972)  has  extended  precedence 
parsing  techniques  to  context  sensitive  grammars.  Although 
he  has  developed  an  algorithm  to  construct  a  precedence  parser 
for  a  particular  grammar,  if  it  exists,  it  is  possible  that 
the  parse  of  an  erroneous  input  may  never  terminate.  In 
practice,  some  means  of  avoiding  this  problem  would  be 
required . 

Revesz  (Revesz  1971)  has  also  studied  the  problem  of 
parsing  context  sensitive  languages  in  a  left  to  right  manner. 
His  techniques  require  that  the  rewriting  rules  of  the  grammar 
be  in  very  special  forms.  Such  restrictions  limit  the 
practical  utility  of  any  parsing  scheme. 

There  have  been  other  context  sensitive  parsing 
methods  in  the  literature  (for  example.  Woods  1970,  Kuno  1967, 
Rosen  1967)  .  These  papers  lie  in  the  field  of  computational 
linguistics.  Their  goal  is  to  construct  a  parser  for  any 
context  sensitive  grammar  and  so  these  parsers  include 
backtracking  capabilities  and  are,  in  general,  unsuitable 
for  programming  language  compilation. 

In  this  thesis,  we  describe  and  investigate  classes 
of  grammars,  defined  so  that  their  parsers  satisfy  the 
requirements  of  the  compilation  process.  Apart  from  the 
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intrinsic  interest  of  these  sets  of  grammars  and  languages, 
they  represent  a  potentially  useful  model  for  programming 
1 anguage  s  . 

The  P  r  oh  1  em 


In  the  Chomsky  hierarchy  of  grammars,  the  subsets  of 
type-0  grammars  (context  sensitive  grammars,  context  free 
grammars,  etc.)  are  defined  in  terms  of  the  characteristics 
of  the  rewriting  rules  (for  example,  rules  in  a  context  free 
grammar  must  have  only  a  single  symbol  on  the  left  side)  . 

On  the  other  hand,  research  into  parsing  has  defined  classes 
of  grammars  primarily  in  terms  of  the  forms  that  derivations 
may  take.  This  is  certainly  true  of  Floyd’s  bounded  context 
grammars  and  of  Knuth’s  more  general  LR(k)  grammars. 

The  reasons  for  this  approach  are  quite  simple. 

Parsing  is  concerned  with  the  syntactic  structure  of  source 
programs  and  so  any  parsing  method,  based  on  a  formal  grammar, 
is  intimately  connected  with  how  strings  are  derived.  Thus 
researchers  have  restricted  their  attention  to  grammars  whose 
derivations  are  amenable  to  efficient  parsing. 

This  thesis  presents  a  new  subdivision  of  type-0 
grammars,  a  classification  defined  in  terms  of  the  forms  of 
derivations  and  the  needs  of  compilation.  Informally,  these 
include  the  requirement  that  a  parser  be  both  time  and  space 
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efficient  and  that  it  detect  errors  very  early  in  a  parse. 

The  obvious  advantage  of  this  kind  of  classification  is  that 
we  may  study,  in  their  most  general  context,  grammars  that 
are  well  suited  to  the  problems  of  parsing  and  compilation. 

Once  we  begin  to  consider  grammars  that  are  not  context 
free,  we  gain  the  ability  to  structure  programming  languages 
(whether  they  are  context  free  or  not)  in  a  way  that  no 
context  free  grammar  could.  Book,  for  instance,  speaks  of  the 
"message  sending"  capability  of  context  sensitive  grammars 
(in  Aho  1973).  Nonterminal  symbols  can  be  used  during  a 
derivation  to  store  and  transmit  information  along  a  sentential 
f  o  rm . 


The  potential  use  of  new  structuring  capabilities  is 
quite  broad.  For  example,  the  syntax  of  a  block  structured 
language  may  allow  the  text  for  a  procedure.  A,  to  occur 
anywhere  in  the  midst  of  the  statements  of  the  containing 
procedure,  B.  The  writer  of  a  one  pass  compiler  would  prefer 
that  all  contained  procedures  should  be  placed  together  at  the 
beginning  of  procedure  B  so  that  special  jump  instructions 
need  not  be  emitted  to  branch  around  the  code  emitted  for  the 
various  inner  blocks  such  as  A.  The  required  restructuring  is 
easily  accomplished  automatically  by  a  parser  based  on  a 
context  sensitive  grammar. 
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De  Remer  has  also  discussed  restructuring  (De  Remer  1973| 

i 

and  has  suggested  the  following  uses:  1 

1 

I 

I 

(i)  language  constructs  that  are  semantically 

equivalent  but  syntactically  different  may  be  mapped 
into  a  single  form; 

(ii)  constructs  that  are  inconveniently  structured  for  ' 
the  purposes  of  code  generation  may  be  transformed; 

(iii)  redundant  or  useless  information  may  be  deleted; 

(iv)  distributed  information  may  be  collected  or  sorted 
(as  we  did  in  our  example  above); 

(v)  restructuring  may  result  in  more  efficient  object 
programs . 

In  general  then,  we  can  consider  the  syntactic  description  of 

a  programming  language  in  two  parts.  A  "kernel"  grammar 

could  be  used  to  derive  standard  versions  of  source  programs,  i 

i 

The  structure  of  these  standard  forms  would  facilitate  the  i 

( 

j 

code  generation  process.  The  second  part  of  the  syntactic  : 
description  would  contain  transformational  productions  that 

I 

are  not  context  free.  These  rules  can  be  applied  to  obtain  J 

I 

the  final  version  of  the  source  program.  The  allowable  I 

i 

forms  of  the  final  source  program  would  be  geared  to  the 
process  of  programming  and  would  take  into  account  factors  sue 


as 


(i)  different  programming  styles;  and 
(ii)  constructs  that  are  structured  to  prevent  programmin 
errors  (by  providing  redundant  information,  etc.). 
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Thus  the  investigation  of  our  new  classes  of  grammars 
provides  a  foundation  for  the  implementation  of  some  of  these 
new  ideas.  For  example,  if  a  parser  generator  can  be  designed 
for  some  of  these  grammars,  the  new  restructuring  capabilities 
could  easily  be  put  into  practical  use.  Furthermore,  De  Remer 
(De  Remer  1973)  has  pointed  out  that  the  use  of  formalisms 
in  this  way  can  increase  our  understanding  of  programming 
language  design  and  compiler  structure.  As  a  side  effect, 
general  results  may  often  give  new  insights  into  special 
subclasses  such  as  LR(k)  grammars.  This  has  been  the  case  in 
this  s  tudy . 

It  is  also  possible  that  a  new  "look”  at  the  range  of 
type-0  grammars  may  lead  to  a  better  understanding  of  the 
Chomsky  hierarchy  and  some  of  the  problems  of  formal  language 
theory.  This  line  of  study  has  not  been  pursued  in  this 
thesis  since  our  main  interest  is  in  the  design  and  construction 
of  practical  parsers. 

The  success  of  LR(k)  parsing  has  led  us  to  seek  the 
general  principles  that  make  this  method  so  attractive.  These 
characteristics  can  be  applied  to  all  type-0  grammars  to 
define  a  new  hierarchy.  Sets  of  grammars  are  defined  which 
possess  some  or  all  of  these  characteristics. 
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Each  type  of  grammar  is  considered  in  this  thesis, 
but  particular  attention  is  paid  to  a  class  we  call 
deterministic  regular  parsable  grammars.  A  parser  that  meets 
all  the  requirements  of  efficiency  and  early  error  detection 
can  be  constructed  from  any  of  these  grammars.  At  the  same 
time,  this  classification,  which  includes  all  LS.(k) 
grammars  5  seems  to  be  quite  a  broad  one. 

After  reviewing  some  notation  and  background  material 
in  Chapter  2,  we  discuss  the  characteristics  that  make  a 
parser  attractive  from  the  point  of  view  of  compilation. 

A  parser  model,  the  two  stack  machine,  is  defined  and  its 
behaviour  and  relation  to  type-0  grammars  is  examined. 

The  remainder  of  the  thesis  investigates  restricted  forms 
of  the  two  stack  machine  and  equivalent  grammatical  classes. 

Not  all  deterministic  two  stack  machines  will  halt  on 
all  inputs.  This  characteristic,  undesirable  in  a  practical 
parser,  is  considered  in  Chapter  4. 

Chapter  5  introduces  the  concepts  of  regular  parsable 
grammars,  It  is  shown  that  parsers  for  deterministic 
regular  parsable  grammars  meet  all  the  requirements  we  set 
out  in  Chapter  3.  The  connection  between  regular  parsable 
grammars  and  various  forms  of  the  two  stack  machine  is 
illustrated . 

A  parser  generator  applicable  to  some  regular  parsable 
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grammars  is  described  and  investigated  in  Chapter  6. 

The  closure  and  decidability  properties  of  these  new 
classes  of  grammar  are  the  subject  of  Chapter  7.  Among 
the  results,  we  show  that  there  is  no  general  parser  generator 
algorithm  applicable  to  all  deterministic  regular 
parsable  grammars.  Such  an  algorithm  does  exist  for  LR(k) 
grammars . 

In  Chapter  8,  we  study  the  applicability  of  research 
related  to  LR(k)  parsing.  Of  particular  importante  is 
the  applicability  of  size  reduction  techniques  to  our  parsers. 
It  is  also  shown  that  the  results  of  these  techniques  can 
be  improved  by  the  use  of  non-context  free  productions  in 
grammars.  This  discovery  was  a  direct  consequence  of  our 
general  approach  to  the  definition  of  grammatical  classes. 

The  thesis  ends  with  our  conclusions  and  a  discussion 
of  some  open  questions  that  have  been  raised  during  the 
course  of  this  investigation. 
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CHAPTER 


2 


DEFINITIONS  AND  NOTATION 

The  notation  and  terminology  used  in  research  in 
language  theory  and  parsing  can  be  quite  varied.  In  this 
chapter,  we  will  define  the  notation  that  will  be  used 
consistently  throughout  the  rest  of  this  thesis.  We  shall 
also  review  the  background  material  relevant  to  this 
research.  A  more  detailed  discussion  of  formal  languages 
can  be  found  in  (Hopcroft  and  Ullman  1969  )  . 


Strings  and  Type-0  Grammars 

Let  (j)  denote  the  empty  set  and  e  denote  the  empty 
string.  The  concatenation  of  two  strings  x  and  y  is  written 
xy .  If  x°  =  e,  then  the  string  x^  is  defined  recursively 
by 

i  i-1  ^  , 

X  =x  X,  fori>l. 

If  X  =  a,a„o..a  ,  then  x^  =  a  ...a„a-,.  If  x  =  a,  then  x^  =  e 
1  2  n  n  2  1 

Similarly,  for  two  sets  of  strings,  X  and  Y, 

XY  =  {xylxeX  and  yeY}.  X°  =  {e}  and  the  set  X^  is 
defined  recursively  by 

=  X^  X,  for  i  >  1. 

X*  is  defined  by 

00  £ 

X*  =  .u^  X 

1  =  0 

and  X  =  X*  -  {e}.  If  X  contains  the  single  element, 

X,  then  X  ,  X*  and  X  are  often  written  x^ ,  x* ,  and  x^ 
respectively . 


The  operations  of  right  quotient  (/)  and  left 
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quotient  (\)  are  related  to  concatenation.  Let  X  and  Y 


ba^ 

any  two 

sets  of  s  tr ings . 

Then 

X/Y  = 

{u 

] uy e X  for  s  ome 

ye 

Y} 

and 

X\Y  = 

{v 

[ xve Y  for  some 

xe 

X}  . 

A  ^ 

e-0 

grammar  (TOG) 

i  s 

a  quadruple  G 

=  (N,T,P,S) 

N, 

T  and  P 

are 

all  finite  sets . 

N  is  the  set 

of  non- 

t e rmina 1 s  and  is  disjoint  from  T,  the  set  of  terminal 
symbols.  P  is  a  set  of  productions  of  the  form  x  y 
where  x  e(NuT)^  and  contains  at  least  one  nonterminal 
symbol  and  y  €(NuT)*.  S  is  a  distinguished  member  of  N 
known  as  the  goal  symbol . 

Without  loss  of  generality,  we  will  number  the 

productions  in  P  in  some  arbitrary  order.  If  x  ^  y  is 
t  h 

the  p  production,  we  may  also  write  it,  x  yAp  .  The 
symbol,  Ap >  is  called  an  apply  symb o 1  and  is  not  formally 
part  of  the  grammar  but  serves  to  identify  the  production 

To  simplify  our  discussion,  we  shall  call  V  =  N uT 
the  vocabulary  of  the  grammar  and  make  the  following 
aotational  conventions: 

Cil  A,  B,  and  C  denote  elements  of  N; 

(ii)  a,  b,  and  c  are  elements  of  T; 

(iii)  capital  letters  near  the  end  of  the  alphabet  denote 
either  terminals  or  nonterminals; 

(iv)  small  letters  near  the  end  of  the  alphabet  denote 
arbitrary  strings  over  V. 
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We  also  define  the  following  useful  string  functions 
for  any  string  w  £  V*: 

(i)  FIRST(w,n)  =  w  for  |w|<n  (|w|  denotes  the  length  of  w) 

vx  for  |wl>n,  where  |xl  =nandxy=w 

-  defined  for  n  >  0  only. 

(ii)  LAST(w,n)  =|W  for  |w|<n 

I 

Lx  for  |w|>n,  where  |x|  =  n  and  yx  =  w 

-  defined  for  n  ^  0  only. 

(iii)  PREFIX(w)  =  {FIRST(w,n)  |  l^n<|w|}. 

The  definition  of  FIRST,  LAST  and  PREFIX  can  be 
extended  to  apply  to  sets  of  strings  so  that  for  example, 
for  some  set  S,  PREFIX(S)  =  {x|x  e  PREFIX(w)  for  some  w  e  S}. 

For  any  binary  relation  R  on  a  set  S,  the  transitive 
closure  of  R,  R^ ,  is  defined  recursively  by 

(i)  if  aRb ,  then  aR  b; 

(ii)  if  aR^b  and  bRc,  then  aR'''c. 

The  reflexive  and  transitive  closure  of  R,  R* ,  is  defined 
by 

(i)  aR*a  for  all  a  in  S; 

(ii)  if  aR^b ,  then  aR*b . 

The  k- f o 1 d  closure  of  R,  R  ,  is  defined  by 

(i)  aR°a  for  all  a  in  S; 

(ii)  if  aR^  ^b  and  bRc,  then  aR^c  . 

Consider  a  TOG  G  =  (N,T,P,S).  A  string  w  is  said  to 

derive  in  one  step  a  string  w’  if  and  only  if  w  =  xyz  and 
w*  =  xuz  and  y  ^  u  e  P.  This  relation  between  w  and  w'  is 
written 


2-3 


w  =>  w 


or  simply  w  =>w'  if  G  is  understood.  Thus  the  transitive 

closure  of  =>  is  =>  (or  =>  if  G  is  understood).  Similarly 
G  G 


the  transitive  and  reflexive  closure  of  =>  is 

G 


=>*  (or  =>*) 


and  the  k-fold  closure  of  =>  is 

G 


k  k . 

>  (or  =>  ) 


If  we  say  derives  .  If  the  goal 

symbol  derives  a  string  w,  then  w  is  a  s  enten t ial  form . 


We  often  refer  to  Wi  =>*  w  as  a  derivation  of  w 

i  m  -  m 

from  W-.  Since  the  notation  w  or  w-=>  W-=>  ...=>  w 

1  1  m  1  2  m 

does  not  tell  us  which  productions  were  used  in  each  step, 

we  can  describe  a  derivation  in  more  detail  by  the  sequence 

(p r T ),...,  (p  ^,r  where  the  i^^  step  in  the  derivation 

11  m-1  m-1 

t  It. 

uses  the  p^  production  (specified  by  its  apply  symbol) . 

The  left  side  of  production  p^  begins  at  the  symbol 

of  w. .  To  say  that  two  derivations  are  the  same  or 
1 

different  means  that  the  sequences  describing  them  are 
the  same  or  different. 


The  language  generated  by  a  TOG,  G,  is  L(G),  and  is 
d e f ined  by 

L(G)  =  {wlS  z:>*w  and  we  T*}. 

If  a  set,  L,  equals  LCG)  for  some  type-0  grammar,  G,  then 
Lisa  type-0  1 anguage  (TOL)  . 

There  exist  well  known  subsets  of  the  classes  of 
type-0  grammars  and  languages.  These  subsets  form  the 
Chomsky  hierarchy.  If  all  productions  in  a  TOG,  G,  are 
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,  then  G  is  a  context 


of  the  form  x  y  and  l<|x|<|y 
sensitive  grammar  (CSG) .  A  context  sensitive  grammar  which 
only  has  productions  of  the  form  A  y  is  a  con t ext  free 
grammar  (CFG).  Finally,  if  the  productions  of  a  CFG,  G, 
are  all  of  the  form  A  ^  Ba  or  A  a ,  then  G  is  a 
left  linear  grammar . 

Clearly  the  class  of  TOG's  includes  all  CSG's,  the 
class  of  CSG’s  includes  all  CFG's,  and  the  class  of  CFG’s 
includes  all  left  linear  grammars.  The  containment  is 
proper  in  each  case. 

By  convention,  the  definitions  of  context  sensitive, 
context  free,  and  left  linear  grammars  are  usually  extended 
so  that  a  grammar,  G  =  CN,T,P,S),  may  include  a  production 
of  the  form  S  ->■  e  if  S  does  not  appear  on  the  right  side 
of  any  other  production.  The  net  effect  of  this  extension 
is  that  such  grammars  may  now  generate  the  empty  string. 

A  language  generated  by  a  TOG,  CSG,  CFG,  or  left 
linear  grammar  is  a  type-0  1 anguage  (TOL) ,  context  sensitive 
language  (CSL)  ,  context  free  1 anguage  (CFL)  ,  or  regular 
language  Cor  regular  set)  respectively.  It  should  be  clear 
from,  our  definitions  that  any  regular  language  is  a  CFL, 
any  CFL  is  a  CSL,  and  any  CSL  is  a  TOL.  It  is  well 
known  that  these  inclusions  are  proper  (Hopcroft  and  Oilman 
1969),  For  example  there  are  context  sensitive  languages 
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that  are  not  context  free  languages. 


We  shall  extend  the  definition  of  context  free 
grammar  even  further  by  allowing  productions  of  the  form 
A  ^  e.  In  (Hopcroft  and  Ullman  1969),  it  is  shown  that 
any  language  generated  by  a  grammar  of  this  type  is  also 
generated  by  a  CFG  which  satisfies  our  previous  definition. 


Canonical  Derivations  and  Ambiguity 

In  discussing  context  free  grammars  (CFG’s),  the 
concept  of  a  r ightmos  t  derivation  as  the  canonical  der ivat io 
is  often  used.  This  kind  of  derivation  is  often  described 
as  one  in  which  the  rightmost  nonterminal  symbol  is 
expanded  in  each  step  of  the  derivation.  We  could  rephra-se 
this  definition  in  more  general  terms  by  describing  a  right¬ 
most  derivation  as  one  in  which  the  nonterminal  symbol 
expanded  in  each  step  of  the  derivation  does  not  lie  to  the 
right  of  the  nonterminal  expanded  in  the  previous  step.  For 
context  free  grammars,  when  we  are  concerned  with  the 
derivation  of  terminal  strings,  the  distinction  between 
these  definitions  is  of  no  importance.  For  example,  let 
xAy  =>xuy  be  a  derivation  step  using  the  production  A  u . 
If  the  derivation  is  to  be  rightmost,  the  next  step  must 
expand  a  nonterminal  in  x  or  u  but  not  in  y.  Now,  if  A  was 
not  the  rightmost  nonterminal  in  xAy ,  then  any  nonterminals 
in  y  will  never  be  expanded.  Thus,  if  we  are  interested 
in  the  derivation  of  terminal  strings,  the  definitions  are 
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entirely  equivalent.  The  distinction  is  important  when 
we  define  a  canonical  derivation  for  TOG’s. 

The  general  concept  of  rightmost  derivation  can  be 
carried  over  to  apply  to  all  TOG’s.  In  this  general  case, 
we  will  refer  to  such  derivations  as  canonic  a 1  de  r i va  t ions. 
Consider  a  derivation 

Wf  =  x^y^z^  =  >  ~  ^2'^2^2  ~  **' 

=  >x  u  z  =  w  „  where  y.  ->  u  ,  . 
m  m  m  m  i  i 

This  derivation  is  canonical  if  and  only  if  x  ^  u^  j^PRE  F IX  ( x^  ^  ^  ) 

for  all  i,  l£i_<m-l  .  In  the  case  of  CFG’s  our  definition 

is  the  same  as  the  definition  of  rightmost  derivation.  As 

for  CFG’s,  it  can  easily  be  shown  that  if  w  derives  w  , 

1  m 

then  there  is  a  canonical  derivation  of  w  from  w, .  When 

m  1 

we  are  speaking  of  this  canonical  derivation,  we  write 

w  =^*  w 
1  G  m 

or  w  =>  *w  when  G  is  understood, 
i  m 

To  emphasize  the  fact  that  a  canonical  derivation 
does  not  require  the  use  of  the  absolutely  rightmost  non¬ 
terminal  at  each  step,  consider  the  following  grammar 

S  ^  S  ’ 

S aS ’ BC 
S aBC 
CB^  BC 
b be 
b  B  bb 
aB-^  ab 


c c c 
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and  the  following  derivation: 

S  =>  S’  =>  aS'BC  =>  aaBCBC  =>  aaBBCC 

=>  aabBCC  =>  aabbCC  =>  aabbcC  =>  aabbcc. 

Note  that  although  the  "rightmost”  nonterminal  is  not 
expanded  at  each  step,  the  derivation  is  canonical  and 
results  in  a  string  of  terminal  symbols. 

With  our  definition  of  canonical  derivation  we  can 
introduce  the  concept  of  ambiguity  in  any  type-0  grammar. 

For  any  TOG,  G,  an  element  of  L(G)  is  said  to  be  ambiguous 
if  it  is  derivable  from  S  by  two  different  canonical  derivatior 
G  is  ambiguous  if  L(G)  contains  an  ambiguous  string.  A  TOL,  L, 
is  inherently  ambiguous  if  and  only  if  all  grammars,  G, 
such  that  L  =  L(G),  are  ambiguous. 

All  our  definitions  are  the  same  as  the  corresponding 
definitions  in  the  case  of  context  free  languages. 

Parsers 


So  far  we  have  looked  at  languages  from  the  point  of 

view  of  deriving  or  generating  strings.  We  are  more  interested 

in  the  dual  process,  determining  whether  a  string  is  in  the 

language  and,  if  so,  how  it  was  derived.  This  process  is 

called  parsing.  A  parse  of  w^  as  w^  is  the  enumeration  in 

some  order  of  a  sequence  (p  ,r  ),...,(p  r  )  describing 

i  i  m- 1  m- 1 

the  derivation  w^  =>*  w^ .  A  parser  for  a  grammar  G  =  (N,T,P,S) 
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is  an  algorithm  which,  given  an  input  string. w,  attempts 
to  parse  w  as  S.  If  the  parse  of  w  as  S  results  in  a 
sequence  (p^,r^),...,(p^,r^)  describing  a  canonical  derivation 
of  w  from  S,  then  the  parser  is  c anoni cal.  Canonical  parsers 
are  of  particular  importance  to  compiler  writers  since 
rewriting  rules  are  applied  in  a  predictable  order  and  code 
synthesis  is  often  directly  related  to  these  rules.  We 
shall  use  the  term  parser  for  canonical  parser  in  this 
thes is  . 
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TWO  STACK  MACHINES 


In  general  terms,  our  goal  is  to  define  and 
examine  the  class  of  languages  that  can  be  parsed  deter¬ 
ministically  in  a  single  left  to  right  scan  without 
backtracking.  Since  backtracking  and  multiple  passes  are 
inherently  inefficient  operations,  this  class  is  the  largest 
set  of  languages  that  are  practical  to  use  as  programming 
languages . 

Clearly  LRCk)  languages,  which  have  proven  to  be 
good  models  of  current  programming  languages,  are  included 
in  this  broad  definition.  Hence,  by  examining  this  larger 
class,  we  may  study,  in  a  general  context,  languages  that 
embody  programming  language  features  and  that  can  be 
parsed  efficiently. 

Because  LRCk)  parsing  is  so  fundamental  to  our 
discussion,  we  begin  with  a  review  of  LRCk)  parser  construction 
and  LRCk)  grammars. 


LRCk)  Parsers  and  LRCk)  Grammars 

For  any  context  free  grammar  CCFG)  ,  G  =  CN  ,  T  ,  P  ,  S )  , 
we  can  construct  a  k-augmented  grammar,  G.  =  (N . , T . , P . , S . ) , 

~  Pi  Pi  A.  Pi  Pi 

where  =  NylS^}  CS^^^V)  ,  =  T  u{i}  (i^^V),  and  is  the 

new  goal  symbol.  P^  contains  all  the  productions  in  P  as 
well  as  a  new  unique  goal  production,  Si^. 

'  Pi 


The  new 


terminal  symbol,  i,  is  called  a  goal  post  and  serves  to 


identify  the  end  of  any  string  generated  by  .  By 

Ic 

convention,  ^  Si  is  given  the  apply  symbol  AO; 

corresponding  productions  in  P  and  P  are  given  the  same 

r\. 

apply  symbols.  is  a  CFG  and  generates  the  language 

LCG) 


Let  G  be  any  CFG  and  let 
grammar  constructed  from  it. 
s  tr ings  of  G  is 


G^  be  the  k-augmented 


The  set  of 


CS(G)  =  { wAp  [  S ->*x^  =>x^yx2,x^y  =  w 


G  X  ^  G 
B  yAp  e  P }  . 


The  set  of  k-augmented 


s  tr ings  i s 


ACSCG,k)  =  { CwAP . z)  I  Either  ->  ^x^Bzx^  =>x^yzx2. 


X 


^y  =  w,  B  ^  yApeP^,  and  |  z|  =  k,  £r^ 


CwAp,z)  =  (Si  AO,e)}, 


The  string,  z,  in  CwAp,z)  is  called  a  look  ahead  string. 
Note  that  the  k  goalposts  in  G^  guarantee  that  all  look 
ahead  strings  (except  in  (Si  AO,e))  are  k  symbols  long. 


G  is  LRCk)  if  and  only  if  there  are  not  two  different 
elements  of  ACS(G,k),  say  (w^Ap^,z^)  and  ^  ,  z  y  such 

that  ^ PREFIX  (w^z^^)  .  Equivalent  definitions  appear 
in  (Knuth  1965,  Hopcroft  and  Ullman  1969,  De  Remer  1969). 

A  language,  L,  is  LR(k)  if  and  only  if  there  exists  an 
LR(k)  grammar  G  such  that  L  =  L(G).  In  a  more  general 
context,  the  term,  LR(k)  languages,  refers  to  the  set  of 
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languages  that  are  LR(k')  for  some  value  of  k’. 


For  some  integer,  k  ^  0,  an  LR (k)  parser  generator 
is  an  algorithm  that,  for  any  given  CFG,  computes  the  set 
of  k-augmented  characteristic  strings  and  determines  if  the 
LR(k)  condition  holds.  An  LR (k)  parser  for  a  grammar  G 
is  composed  of  a  finite  state  control  and  a  stack.  The 
structure  of  the  finite  state  control  is  based  on  the 
set  AC  S ( G , k )  . 


LR(k)  Parser  Generator 


There  have  been  several  descriptions  of  Knuth’s 
algorithm  in  the  literature  (Knuth  1965,  Anderson  et  al 
1973,  Aho  and  Ullman  1972b).  Often  different  terminology 
is  used.  In  a  later  chapter  we  will  introduce  a  new 
parser  generator  algorithm,  for  TOG’s,  that  is  similar  in 
many  respects  to  Knuth’s  LR(k)  algorithm.  Our  notation  will 
parallel  Knuth's  CKnuth  1965)  with  some  slight  differences. 

Let  k  be  any  non-negative  integer  and  let  G  = 

CNa > Ta , P A > ^A^  be  the  k-augmented  grammar  constructed  from 
a  CFG,  G  =  CN , T , P , S ) .  Informally,  the  algorithm  we  will 
describe  attempts  to  construct  the  states  of  the  finite 
state  control  of  an  LRCk)  parser  for  G.  In  the  process  it 
determines  whether  G  is,  in  fact,  LR(k). 

A  state  q  in  the  control  is  constructed  by  first 
generating  a  state  set  Q(q),  unique  to  q.  Q(q)  is  made  up  of 
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items  of  the  form  (p,j,w)  where  p  is  the  number  of  one  of 


the  productions  (specified  by  its  apply  symbol),  j  is  the 


location  (indexed  from  zero)  of  one  of  the  symbols  on  the  righ 

k  • 


side  of  production  p,  and  we  (N^  u  T^)  is  a  look  ahead  string.! 
Often  it  is  convenient  to  represent  the  same  item  by 


(A  ^  Y„...Y.  ,aY.Y.  ^...Y  Ap  ,w)  ,  where  a  precedes  the  j 
0  j"ljj‘'‘l  ^p 


.  th 


symbol  in  Yq...Y^  Ap.  The  existence  of  this  item  in  Q(q) 


implies  that,  if  the  finite  control  is  in  state  q,  then  the 
input  has  been  parsed  to  the  form  xY^ . . and  that 


xAy  is  a  sentential  form  for  .  Furthermore  the  first  k 


symbols  in  y  might  be  w. 


To  calculate  look  ahead  strings  we  define  a  function 
H^(v),  defined  for  any  v  in  (N^  u  T^)**  ^k^^^ 

of  all  length  k  strings  that  are  prefixes  of  strings 
derivable  from  v.  Thus 

H^(v)  =  {u  I  |u|  =  k  and  v  =>*  uz  for  some  z£(N^uT^)*}. 

Following  Knuth,  state  sets  are  created  by  starting 
with  some  initial  set  of  items  and  computing  the  closure 
of  this  set.  The  inclusion  of  new  items  is  determined  by 
a  closure  f  un  c  t  i  on ,  C.  For  some  item,  (A  y^Ay2Ap,w)  ,  define 
C((A  ^  y^Ay2Ap,w))  =  {(B  ^  AuAr,w')  |  B  =  FIRST(y2,l), 

w’  eH^(y^w)  }  . 

Starting  with  the  initial  set  of  items,  Q(q)’,  the  state  set 
Q(q)  is  defined  to  be  the  smallest  set  satisfying 

Q(q)  =  Q(q)'  u  {(p,0,w)  |  ( p , 0 , w ) e C (  (  r  ,  j  ,  w  ’  ) )  , 


(r  ,  j  ,w’ )€Q(q)  }  . 
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Once  Q(tl)  is  found,  we  can  construct  the  transitions 
from  q.  This  will  be  done  in  two  steps.  For  now,  the  look 
ahead  strings  will  not  be  used.  Let  us  partition  Q(q)  by 
defining  the  sets  T(Ap)  and  t(Y)  to  be 

T(Ap)  =  {(A  yAAp,w)  I  (A  y  a  Ap  ,  w )  e  Q  ( q  )  }  ,  and 
t(Y)  =  {(A  y^AYy2Ap,w)  |  (A  ->  y  ^  AYy  ^  Ap  ,w)  e  Q  (q  )  }  . 

If  t(Y)  is  not  empty,  then  q  has  a  read  tr ans i t ion  for  the 
symbol  Y  to  another  state,  q' .  The  initial  state  set  of  q' 
will  be 

{(A  ^  y^YAy^ApjW)  I  (A  ^  y ^ AYy ^Ap ,w) e T (Y) } . 

If  T(Ap)  is  not  empty,  then  q  will  have  a  reduce  transition, 
Ap,  to  a  special  state  q^.  q^  is  accessed  by  all  reduce 
transitions  and  has  no  transitions  itself.  If  Ap  refers  to  a 
production  with  an  empty  right  side  (A  eAp),  then  the  reduce 
transition,  Ap,  is  replaced  by  a  read  transition  for  the  empty 
string,  e.  Informally,  this  transition  allows  the  parser  to 
change  state  by  "reading"  e.  The  destination  of  the  read 
transition  is  a  new  special  state  with  the  single  reduce 
transition,  Ap,  to  q^.  These  special  read  transitions  will 
enable  us  to  treat  all  reduce  transitions  in  a  uniform  way. 

The  construction  process  continues  by  closing  the  state 
sets  for  any  new  states  that  were  created  above.  Knuth  has 
shown  that  this  algorithm  must  eventually  terminate 
(Knuth  1965)  . 

The  start  state  of  the  control  is  q^.  The  construction 
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The 


of  the  parser  begins  by  computing  the  state  set  of  q^. 
initial  state  set  of  is  ^  ASl  AO,e)}. 

We  shall  call  the  parser,  constructed  by  this  process, 

M(0)  . 


So  far  we  have  not  used  the  look  ahead  strings.  Of 

course,  if  k  =  0,  then  all  look  ahead  strings  are  empty  and, 

in  this  case,  the  construction  of  the  parser  is  complete.  I. 

k  >  0,  we  will  reconstruct  the  transitions  of  the  parser.  Fo; 

any  symbol  t  (t  is  either  in  (N  u  T  )  or  is  an  apply  symbol^ 

A.  A. 

and  any  state  q,  we  define  'l'Q(q,t)  =  {e}.  If  i  >  0  and 

t  =  Ap  ,  then 

'i'^(q,Ap)  =  {w*  I  w '  =  F  IRST  (w  ,  i  )  and  (u  ^  yA  Ap  ,w)  e  T  (  Ap  )  } 

If  i  >  0  and  t  =  Ye (N^  u  T^) >  then 

'i'^(q,Y)  =  {Yw'  ]  w '  e'l'^_^  (q"  ,  t  ’  )  ,  M(0)  has  a  read 

transition,  Y,  to  q"  and  t ' e (N^  u  T^)  or 
t’  is  any  apply  symbol}. 

By  definition,  if  T(t)  is  empty,  then  'i'j|^(q,t)  =  (()  for  all 

i  >  0. 

If  t(Y)  is  not  empty,  then  q  has  a  read- an d-look- ahead 
trans  it  ion  to  state  q’  for  every  element  in  the  set  'l'j^(q,Y)  . 
q'  is  the  same  state  that  was  accessed  by  the  original  read 
transition,  Y.  Similarly,  if  T(Ap)  is  not  empty,  then  q  will 
have  a  1  ook  ahead  trans  i  tion  for  every  element  in  'l'^(q,Ap)  to 
a  new  state  with  the  single  reduce  transition,  Ap ,  to  q^. 
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t  ll 

If  the  p  production  has  an  empty  right  side  (A  ^  eAp),  then 
the  transitions  are  r ead -and -1 ook -ahe ad  transitions  for  members 
of  the  set 

{  ev  I  ve  (q  ,  Ap  )  }  . 

These  transitions  are  similar  to  read  transitions  for  the 
empty  string.  They  also  permit  reductions  to  be  handled  in 
a  uniform  way.  This  will  become  apparent  when  we  discuss  the 
operation  of  the  parser. 

Knuth  (Knuth  1965)  determines  read  and  look  ahead 
transitions  directly  from  the  state  sets.  For  a  given  state 
set,  Q(q),  if  t(Y)  is  not  empty,  then  q  will  have  read-and" 
look-ahead  transitions  for  every  element  of  the  set 

{Yz  I  zeHj^_^(y2w)  where  (A  ^  y  ^  a  Yy  2  Ap  ,  w )  e  T  ( Y  )  }  . 

Our  two  stage  approach  will  result  in  exactly  the  same  parser. 

Another  difference  between  Knuth's  description  and 
ours  is  that  Knuth  requires  look  ahead  strings  to  be  in  • 

We  have  included  look  ahead  strings  that  contain  nonterminals. 
As  a  result,  our  algorithm  may  create  a  look  ahead  transition, 
V,  or  a  re  ad -and -1 ook -ahead  transition,  Yv,  such  that  v 
contains  some  nonterminal  symbols.  In  practice,  an  LR(k) 
parser  is  used  to  parse  strings  of  terminal  symbols  only.  In 
these  circumstances,  transitions  such  as  the  above  will  never 
be  used.  Thus,  Knuth's  requirement  may  be  viewed  as  an 
optimization  of  the  parser  generator  algorithm  and  of  the 
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parser.  A  parser,  constructed  according  to  our  version  of 
the  LR(k)  algorithm,  will  accept  exactly  the  same  set  of 
strings  of  terminals  symbols  as  a  parser  constructed 
according  to  Knuth’s.  Our  description  of  the  algorithm 
will  facilitate  comparison  with  a  new  parser  generator  that 
is  presented  in  Chapter  6. 

If  a  state,  q,  in  the  final  parser,  has  a  reduce 
transition  and  any  other  transition  or  if  q  has  two 
transitions,  w^  and  W2 ,  leading  to  different  states  and 
w^ ePREF IX (W2 ) j  then  G  is  not  LR(k).  In  particular,  if 
k  =  0  and  q  has  a  read  transition  for  the  empty  string,  then 
G  is  not  LR(0). 

LR(k)  Parser  Operation 

An  LR(k)  parse  begins  by  emptying  the  stack  and 
placing  the  finite  control  in  state  q^.  At  any  time  during 
the  parse,  the  finite  control  is  in  a  state,  q,  and  one  of 
two  cases  holds. 

(i)  State  q  has  read  transitions  (if  k  =  0)  or  read-and- 
look-ahead  or  look  ahead  transitions  (if  k  >  0) . 

Let  us  assume  that  the  transitions  from  q  are  x, . x  . 

1  ’  n 

The  parser  compares  each  x^,...,x^  to  the  first  symbols  in 
the  input.  If  none  of  them  matches,  then  an  error  has  been 
found  and  the  parse  terminates.  Otherwise  the  beginning 
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of  the  input  matches  a  single  transition,  ,  only.  If 

the  transition  x.  is  a  look  ahead  transition  to  state  q’, 

1 

the  finite  control  simply  changes  its  state  to  q'.  If 
X.  is  a  read  transition  then  x.  =  X  e  (N.uT.).  By 
construction,  a  read  transition  x^  =  e  is  not  possible  in 
an  LR(k)  parser.  The  pair  (X,q)  is  pushed  into  the  stack, 
X  is  removed  from  the  input  and  the  control  enters  state 
q'  .  Finally, x^  may  be  a  read -and -look -ahead  transition 
Xt  or  et  to  q'.  For  Xt,  (X,q)  is  pushed  into  the  stack 
and  X  is  removed  from  the  beginning  of  the  input.  For 
et,(e,q)  is  pushed  into  the  stack,  but  the  input  remains 
unchanged.  In  both  instances,  the  control  then  enters 
state  q ’ . 

(ii)  State  q  has  a  single  reduce  transition  Ap . 

Let  the  p’"*^  production  be  A  X^...X^Ap.  By 
construction,  the  top  n  symbols  in  the  stack  must  be  of 
the  form  (X^ , q | ) . . . (X^ , q^ ’ ) .  These  symbols  are  popped 
from  the  stack,  A  is  placed  at  the  front  of  the  input, 
and  the  control  enters  state  q^'  (by  construction,  q^ ' 
will  then  read  the  symbol  A) .  For  productions  with  empty 
right  sides  (A  ->  eAp),  one  symbol  ©f  the  form  (e,q^’)  is 
removed  from  the  stack.  A  is  placed  at  the  front  of  the 
input  and  the  control  enters  state  The  use  of  the 

special  r e a d -an d -1 o ok -ah e a d  transitions  of  the  form  ew 
should  be  clear.  Had  they  not  been  used,  the  state  q^' 
would  not  have  been  stacked. 
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This  operation  continues  until  an  error  has  been 
found  or  until  the  control  enters  with  an  empty  stack 

and  only  the  goal  symbol,  S^,  in  the  input.  In  this 
latter  situation,  the  input  is  accepted. 

Among  other  interesting  results,  Knuth  showed  that 

(i)  an  LR(k)  parse  will  eventually  halt  for  any  input; 

(ii)  for  any  CFG,G,CS(G)  is  a  regular  set; 

(iii)  the  set  of  languages  recognizable  by  a  deter¬ 
ministic  pushdown  automaton  (DPDA)  is  exactly 
the  class  of  languages  that  have  an  LR(1)  grammar; 

(iv)  for  a  given  k,  there  is  an  algorithm  to  determine 

for  any  CFG , G ^whe ther  G  is  LR(k)  (we  have  described 
this  algorithm); 

(v)  there  is  no  algorithm  to  determine  whether  there 
exists  a  k  such  that  a  CFG  is  LR(k)  . 

If  k  is  greater  than  zero,  the  algorithm  we  have 
given  often  generates  a  very  large  number  of  states. 
Techniques  for  reducing  the  number  of  states  generated 
by  the  LR(k)  algorithm  are  discussed  in  (Aho  and  Ullman 
1972b,  Joliat  1973,  Anderson  et  al  1973).  De  Remer  has 
taken  a  different  approach  (De  Remer  1969  )  .  He  noticed 
that  many  state  sets  differ  only  in  the  look  ahead 
components  of  their  items.  Also,  the  look  ahead  informatio’ 
is  only  necessary  in  certain  states  where  it  determines 
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whc  _iier  a  new  symbol  should  be  read  or  a  reduction 
should  be  made. 

To  construct  an  LR(k)  parser  with  fewer  states 
De  Remer  starts  by  trying  to  generate  an  LR(0)  parser. 

If  look  ahead  is  needed,  it  is  computed  later  and  added  to 
the  basic  parser.  In  more  detail,  De  Remer  uses  Knuth's 
LR(0)  constructor  on  the  grammar  G.  The  resulting  finite 
state  control  is  called  the  characteristic  finite  state 
machine  (CFSM)  because  it  recognizes  the  characteristic 
strings  of  G.  The  CFSM  may  contain  inadequate  states  , 
states  with  both  reduce  and  read  transitions.  These  states 
require  look  ahead.  De  Remer  suggests  several  algorithms, 
of  increasing  complexity,  to  compute  look  ahead  transitions 
These  include  his  SLR(k),  LALR(k),  and  L(m)R(k)  algorithms. 

Korenjak  (Korenjak  1969)  has  also  suggested  a 
method  of  generating  LR(k)  parsers  of  a  more  practical 
size.  Aho  and  Ullman  (Aho  and  Ullman  1972b),  Anderson, 

Eve,  and  Horning  (Anderson  et  al  1973),  and  Joliat  (Joliat 
1973)  all  discuss  methods  of  reducing  the  size  of  the  parser 
once  it  has  been  constructed  by  one  of  the  above  algorithms. 
These  optimizing  techniques  will  also  be  applicable  to  the 
parsers  we  will  discuss  in  subsequent  sections. 
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The 


of  "Good"  Parsers 


LR(k)  parsers  possess  many  features  that  are 
considered  necessary  or  desirable  for  any  parser  that 
is  to  be  a  part  of  a  compiler.  Let  us  examine  these 
features  in  more  detail. 

(i)  A  fundamental  requirement  of  any  parser  is 

that  it  accept  only  source  programs  that  are 
defined  to  be  correct  or  well  formed.  Further¬ 
more,  for  any  input  string,  the  parser  algorithm 
must  eventually  terminate,  accepting  or  rejecting 
the  i np  u  t . 

Requirement  (i)  is  really  the  only  necessary 
constraint  on  a  recognizer.  However  there  are  other  character¬ 
istics  that  greatly  enhance  the  quality  of  any  parser. 

(ii)  A  parsing  algorithm  should  be  deterministic  and 

require  no  backtracking.  A  deterministic  simulation 
of  all  moves  of  a  basically  nonde t e rmini s t i c  parser, 
that  is,  the  use  of  backtracking,  is  usually  too 
inefficient  for  practical  compilation. 

(iii)  Another  efficiency  constraint  is  that  only  a  single, 
one  way  scan  through  the  input  should  be  made.  For 
practical  reasons,  this  pass  should  be  made  from 
left  to  right;  this  is  the  usual  sequence  in  which 
the  input  is  provided. 
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(iv)  Errors  should  be  detected  as  early  as  possible  in 
the  parse.  During  the  left  to  right  scan  of  the 
input,  the  parser  should  never  scan  past  a  point 
at  which  it  can  be  established  that  the  input 
contains  an  error. 


LR(k)  parsers  have  all  of  these  properties. 


We  will  define  and  investigate  some  very  general 
classes  of  parsers  and  grammars  that  embody  some  or  all 


of  these  highly  desirable 


Two  Stack  Machines 


Our  parser  model,  the  two  stack  machine,  will  be 
very  similar  to  the  informal  model  we  used  in  Chapter  I. 


The  finite  control  is  always  in  one  of  several  possible 
states.  Initially,  the  input  is  pushed  into  the  R-STACK 
(which  acts  as  an  output  restricted  deque)  and  the  control 
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1 


is  in  state  q^.  The  L-STACK  is  empty.  At  each  step  in 
its  operation,  the  two  stack  machine  may  either 

(i)  read  -  remove  a  symbol  from  the  left  end  of  the 

i  “ 

R-STACK,  push  that  symbol  into  the  L-STACK  and 
change  state; 

(ii)  look  ahead  -  change  state  after  looking  at  a 

bounded  number  of  symbols  at  the  left  end  of 

the  R-STACK;  or  j 

t 

I 

t 

(iii)  reduce  -  pop  some  symbols  from  the  L-STACK,  i 

i 

push  some  symbols  into  the  left  end  of  the  j 

j 

R-STACK, and  change  state.  j 

After  the  input  is  placed  in  the  R-STACK,  | 

symbols  can  only  be  pushed  into  or  popped  from  the  left  | 

I 

end  of  the  stack.  This  end  will  be  called  the  top  of  the 
R-STACK. 

i 

To  give  the  two  stack  machine  a  formal  definition,  ' 
we  introduce  an  auxiliary  concept,  the  two  stack  sy s  tern .  ; 

I 

A  two  stack  system  is  a  quintuple  M  =  (K,V  ,V  ,T,I)  where' 

L  K 

(i)  K  is  a  finite  non-empty  set  of  states.  q^  is  the  < 

i  i 

i 

unique  start  state;  i 

i 

(ii)  is  a  finite  set  of  symbols  that  may  appear  in  i 

the  R-STACK;  | 

i 

(iii)  ^^L^^R  ~  finite  set  of  symbols  which 

may  appear  in  the  L-STACK.  Each  element  of  V-,  is  ol 

J_i 
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the  form  (X,q)  with  X  in  V  and  q  in  K.  V  may 
also  contain  elements  of  the  form  (e,q)  where  e 
is  the  empty  string; 

(iv)  TcV„  is  the  terminal  set  of  M.  The  distinction 
between  symbols  in  T  and  V  -T  is  similar  to  that 
between  terminals  and  nonterminals  in  a  grammar; 
and 

(v)  I  is  a  finite  set  of  ins  true  t ions .  Elements  of 
I  are  of  five  types. 

(a)  reduce  instruction  -  (X,q)yq»  or 

(e,q)q'  qx  and  X  e  V^,y  e  ,  x  e  V^+ . 

(b)  look  ahead  instruction  -  qx  ->  q’x,  x  e 

--  ■■  R 

(c)  read- an  d-look- ahead  ins  t  ruct  ion  “  qXy  (X,q)q'y, 

X  e  u  {e},'y  £  . 

(d)  read  instruction  -  qX  ->■  (X,q)q’,  XeV^u  {e}. 

(e)  accept  instruction  -  q^S  ACCEPT  . 

The  concatenation  of  the  L-STACK,  followed  by  the 
current  state  of  M  and  then  the  contents  of  the  R-STACK  is 
called  a  parse  s  t r ing .  Thus,  in  a  parse  string  xqy ,  the 
contents  of  the  L-STACK  is  the  string  x  (the  top  of  the 
L-STACK  is  at  the  right  end  of  x) ,  the  state  of  M  is  q, 
and  the  R-STACK  contains  the  string  y  (the  top  of  the 
R-STACK  is  the  first  symbol  of  y) .  Instructions  may  be 
thought  of  as  grammatical  productions,  in  reverse,  operating 
on  the  parse  string.  If  no  instruction  can  be  applied  to 
a  parse  string,  then  an  error  has  been  found. 


A  state,  q,  of  M  is  classified  according  to  the 


types  of  instructions  that  may  be  used  when  M  is  in  that 
state.  An  instruction  is  applicable  to  state  q  if  it  is 
of  the  form  yq  ^  q'x  or  qy  ^  .  .  If  the  only  instructions 

applicable  to  q  are  read,  r e ad- and- 1 o ok- ah e ad ,  look  ahead, 
or  accept  instructions,  then  q  is  a  read  state .  If  only 
reduce  instructions  are  applicable  to  q,  then  q  is  a  reduce 
state .  A  state  q  is  said  to  be  inadequate  under  any  of  the 
following  conditions. 

(i)  Both  reduce  and  non-reduce  instructions  are 
applicable  to  q.  In  this  case,  q  is  neither 
a  read  nor  a  reduce  state. 

(ii)  q  is  a  read  state  with  the  following  instructions 

qt^  ^  ... 

« 

qt  ^  . . . 

m 

and,  for  some  i  j  ,  e  PREFIX(tj). 

(iii)  q  is  a  reduce  state  with  at  least  two  reduce 
instructions 

(^l>*ll  ^  ^1*^1 

and 

such  that  x-  ^  x„  or  .  .  .Y  X,  .  .  .X  . 

1  2  1  n  1  m 
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A  state  that  is  not  inadequate  is  said  to  be  adequate  . 

The  notion  of  inadequacy  is  a  straightforward  extension  of 
De  Remer’s  concept  of  inadequacy  (De  Remer  1969  )  .  The 
possibility  of  inadequate  states  means  that,  in  general, 
a  two  stack  system  is  a  nonde t e rmini s t i c  device. 

It  will  be  convenient  to  define  a  relation  h  between 

M 

parse  strings.  Informally  if  xqy  is  a  parse  string  and  there 

is  an  instruction  to  change  this  to  x'q’y’,  then  we  write 

xqy  \-  x'q’y’ 

M 

or  xqy  }-  x’q’y’  if  M  is  understood.  In  particular,  for  each 

type  of  instruction,  is  defined  as  follows. 

II 

Let  u  be  in  V  *  and  v  in  V  *. 

L  K 

(i)  (X,q)yq’  ->  qx,  a  reduce  instruction. 

u(X,q)yq'v  |-  uqxv 
M 

(ii)  (e,q)q’  qx,  a  reduce  instruction. 

u(e,q)q’v  |-  uqxv 
M 

(iii)  q’x  qx  ,  a  look  ahead  instruction. 

uq  ’  XV  I-  uqxv 
M 

(iv)  9  ’  Xy  (X,q’)qy,  a  r  e  ad- and- 1  o  ok- ahe  ad  instruction. 

uq'Xyv  h  u(X,q’)qyv 
M 

(v)  q’y  (e,q’)qy,  a  read- and- look-ahead  instruction. 

uq ’yv  h  u(e ,q ’ )qyv 
M 

(vi)  q*X  ^  (X,q’)q,  a  read  instruction. 

uq ’ Xv  h  u(X,q’)qv 
M 
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(vii) 


(e,q')q,  a  read  instruction. 


(viii) 


we  say 


if  and 

(i) 


(ii) 


(iii) 


q’  - 

uq  ’  V  [-  u  (e  ,  q  '  )  qv 
M 

q  S  ->  ACCEPT,  an  accept  instruction. 

1 

uq  Sv  |-  u  ACCEPT  v 

1  M 

If  there  exists  an  instruction  q’Xt  ^  (X,q’)qt, 
that  q  is  accessed  by  X. 

A  two  stack  system,  M,  is  a  two  stack  machine  (2SM) 
only  if  it  meets  the  following  five  constraints. 

All  states  are  accessed  by  a  unique  symbol.  Thus  if 
there  are  instructions  q’Xt  ^  (X,q’)qt  and 
q’Yt’  ^  (Y,q’)qt’,  then  X=Y.  Both  t  and  t’  may  be 
the  empty  string. 

Let  qXt^  ^  (X,q)q’t^  and  qYt2  ^  (Y,q)q"t^  be  any  two 
read  or  read  and  look  ahead  instructions  (if  there  ai 
two)  applicable  to  state  q.  If  X  ^  e  and  X  =  Y,  thei 
q  -  q  . 

A  reduce  instruction  of  the  form  (e,q’)q  ->  q’x 
corresponds  to  the  notion  of  productions  in  a 
type-0  grammar  (TOG)  of  the  form  x  e .  A  state  q 
appears  on  the  right  side  of  an  instruction  of  the 
form 

q't  ^  (e,q')qt,  (t  e  V  *) 

R 

if  and  only  if  the  only  instructions  applicable  to 
q  are  of  the  form  (e,q’)q  ^  q’x. 
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C  If  qx  ->  q’x  is  a  look  ahead  instruction,  then  q’ 

must  be  a  reduce  state. 

(v)  For  any  state  q,  there  is  a  possibly  empty  series 

of  read,  read- and -look-ahead ,  or  look  ahead  instructions 
so  that  for  some  x,  y  and  z 
xqy  z  h  *  xy  ’  q  ’  z 

and  there  is  a  reduce  instruction  applicable  to 

q'.  Moreover,  if  there  exists  a  reduce  instruction 

(Y,,q^  ’)...(¥  ,q  ’)q  q,'x,  then  there  are  read, 

11  n  n  1 

read- and- look- ahead ,  or  look  ahead  instructions  so 
that 

uq’Y  ...Y  z  |-'^u(Y  ,q  ’)...(Y  ,q  ')qz 
11  n  11  n  n 

[-  uq ^  '  xz  . 

The  converse  must  also  hold.  If 

uq  'Y  ...Y  z  h  ■'u(Y  q  ’)  .  .  .  (Y  ,q  •)qz 
1  1  n  11  n  n 

and  q  has  a  reduce  instruction 
(Y^,q^") . . . (Y^,q^”)q  ^  q^"x 
then  the  reduce  instruction 

,q^  ’  )  .  .  .  (Y^,q^’ )q  q^’x 

exists . 

This  constraint  implies  that  some  state  with  a 
reduce  instruction  can  be  reached  from  any  state 
and  that  there  are  no  "useless”  reduce  instructions. 

The  set  of  strings  accepted  by  M  is 
T(M)  =  {w  e  1  q^w  [-  *ACCEPT}. 
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Often  we  are  more  interested  in  the  language  accep  ted  by  M, 
L(M)  =  {w  I  w  6  T(M)  and  w  e  T*}.  The  difference  between 
the  two  sets  is  analogous  to  the  difference  between  the 
set  of  all  sentential  forms  related  to  a  grammar  and 
the  grammar’s  language. 

Our  two  stack  machines  are  similar  to  those  used 
by  Griffiths  and  Patrick  (Griffiths  and  Patrick  1965) 
and  by  Colmerauer  (Colmerauer  1970)  .  Their  two  stack 
parsers  are  also  controlled  by  instructions,  but  the 
control  itself  is  not  explicitly  mentioned.  Thus  there 
is  no  notion  of  state  in  their  parsers.  In  general,  there 
is  no  restriction  on  the  form  of  instructions  as  there  is 
in  our  model.  Our  2SM  definition  was  developed  as  the 
formalization  of  a  two  stack  parser  run  by  a  finite 
state  machine  control.  The  operation,  capabilities, 
and  structure  of  this  parser  are  straightforward 
generalizations  of  the  LR(k)  parser  case.  In  fact,  we 
shall  see  that  LR(k)  parsers  are  simply  special  kinds 
of  two  stack  machine. 

It  should  be  clear  that  the  set  of  state  strings  , 

C'  =  {zq  I  q^^w  \-  *  zqz'  for  some  w  e  ^  regular 

set.  An  interesting  subset  of  C’  is  the  set  of  state 
strings  that  end  in  a  reduce  state. 
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C  =  { zq 


zq  £  C’  and  q  is  a  reduce  state}. 


C  is  also  a  regular  set.  This  fact  suggests  the  alternate 
visualization  of  M's  finite  control  as  a  finite  state 
machine  (FSM)  (Hopcroft  and  Ullman  1969),  with  states 
in  K. 


To  convert  a  2SM  description  to  an  FSM  description, 

we  first  number  the  reductions  in  some  arbitrary  order. 

For  the  purposes  of  identification,  two  reduce 

instructions  (Y^  , q ,')...  (Y  ,q  ')q  ^  q  'x,  and 

1^1  n^ny  ^1  1 

(X-,  ,  q  ^  " )  .  .  .  (X  ,q  ")q  ^  q^"x^  refer  to  different 

11  m  ^m  X  1  2 

reductions  if  and  only  if  Y,,..Y  X,...X  or  x,  ^  x„  . 

1  n  1  m  12 

The  transitions  of  the  FSM  are  constructed  from 
the  instructions  as  follows: 

(i)  qt  ^  q't  -  q  has  a  transition,  t,  to  q'. 

(ii)  qX  (X,q)q'  -  q  has  a  transition,  X,  to  q'. 

(iii)  qXt  ->  (X,q)q't  -  q  has  a  transition,  Xt ,  to 
q'.  X  may  be  the  empty  string. 

-L 

(iv)  yq  ->  q'x  -  If  this  refers  to  the  r  reduction, 

then  q  has  an  Ar  transition  to  a  new  state,  q^. 
q^  has  no  transitions  itself.  Often  a  reduce 
transition  corresponds  to  more  than  one  reduce 
instruct i on . 
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The  types  of  the  transitions  are  named  after  the  types  of 
the  instructions  from  which  they  are  derived. 


(iv) 


The  FSM  control  requires  the  following 


(i)  when  following  a  read  transition  for  the  symbol  X 
from  state  q  (X  =  e  is  possible),  push  the  pair 
(X,q)  into  the  L-STACK.  If  X  e,  remove  X  from  th 


top  of  the  R-STACK.  Finally,  change  state; 


(ii)  for  a  read-and-look-ahead  transition,  Xt,  from  q 


(that  is,  the  top  |Xtj  symbols  of  the  R-STACK  are  X  ] 
push  the  pair  (X,q)  into  the  L-STACK.  If  X  e. 


remove  X  from  the  top  of  the  R-STACK.  Finally,  chai.e 


state; 


(iii)  if  a  state  q  has  a  look  ahead  transition,  t,  and  tht 


top  |t|  symbols  in  the  R-STACK  form  the  string  t,  tin 


change  state.  The  R-STACK  and  the  L-STACK  are  unchcgl 


if  one  of  the  instructions  for  the  r. 

1 


th 


reduction 

is  yq  -*■  q’x,  then  taking  the  reduce  transition  Ar^ 
involves  the  following  operations.  Pop  the  first 
|y|  symbols  from  the  L-STACK  and  push  x  into  the 
R-STACK.  If  the  last  symbol  popped  from  the  L-STACK 
is  (X,q'),  then  transfer  to  state  q’; 


(v)  if  the  accept  instruction  was  q^S  ACCEPT,  then 


if  the  FSM  is  in  state  q^  and  S  is  the  only  symbol  i 
the  R-STACK,  accept  the  input; 


(vi)  if  none  of  the  above  apply,  an  error  has  been  found. 
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The  input  is  rejected  and  the  parser  halts. 


An  FSM  model  for  the  finite  control  can  also  be 
converted  into  a  formal  2SM  description.  We  shall  use  the 
two  formulations,  and  their  corresponding  terminology,  inter¬ 
change  ab  ly  . 

An  example  should  illustrate  the  equivalence  of  the 
two  descriptions.  Consider  the  following  CSG(context 
sensitive  grammar)  G  =  ( { S , A , B } , { a , b , i } , P , S )  with  the 
productions 

S  ^  AlAO 
A  aAAl 
aA->  bAA2 
A  BA3 
B  ->■  abA4. 

A  parser  for  L(G),  taking  the  finite  state  machine  approach, 
is  illustrated  by  the  state  diagram 


(i)  When  reading  a  symbol,  the  FSM  control  removes 
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it  from  the  top  of  the  R-STACK,  places  it  along  with  1 
the  current  state  in  the  L-STACK,  and  changes  state. 

The  transition  from  state  8  marked,  {l},  is  a  look 
ahead  transition.  If  the  control  is  in  state  8  and 
the  top  symbol  in  the  R-STACK  is  i,  then  change  the 
state  to  10.  The  stacks  are  not  affected. 

(ii)  For  a  reduce  transition,  the  FSM  control  performs 
the  following  actions: 

AO:  The  L-STACK  can  only  contain 
(A,l)  (1,2)  . 

These  two  symbols  are  removed  from  the  L-STACK 
and  S  is  pushed  into  the  R-STACK.  The  control 
is  placed  in  state  1. 

Al:  The  L-STACK  may  contain  (a,l)(A,3),  (a,8)(A,3) 

(a,3)(A,3),  or  (a,4)(A,3).  Two  symbols  are 
popped  from  the  L-STACK  and  A  is  pushed  into  the 
R-STACK.  The  control  enters  the  state  that  occurs 
in  the  last  symbol  popped.  This  state  will  be 
1,8,3  or  4  . 

A2:  The  L-STACK  may  contain  (b,3)(A,8),  (b,l)(A,4)  , 

(b,4)(A,4),  or  (b,8)(A,4).  Two  symbols  are 
popped  from  the  L-STACK,  and  A  and  then  a 
are  pushed  into  the  top  of  the  R.-STACK. 

The  control  enters  the  state  that  occurs  in  the 
last  symbol  popped.  This  state  will  be  3,1,4,  or  8 
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A3:  The  L-STACK  contains  (B,8),  (B,4)  , 

or  (B,3).  One  symbol  is  popped  from  the  L-STACK 
and  an  A  is  pushed  into  the  R-STACK.  The  control 
enters  either  state  1,8,4,  or  3  depending  on  the 
symbol  popped  from  the  L-STACK. 

A4  :  The  L-STACK  contains  (a,l)(b,3),  (a,8)(b,3), 

(a,3)(b,3),  or  (a,4)(b,3).  Two  symbols  are 
popped  from  the  L-STACK,  and  B  is  pushed  into 
the  R-STACK.  The  control  enters  state  1,8,3, 
or  4  . 

(iii)  If  no  move  is  possible  an  error  has  been  found  and 
the  parse  terminates. 

(iv)  The  start  state  is  1.  If  the  parser  is  in  state  1 
and  the  only  symbol  in  the  R-STACK  is  S,  then  the 
input  is  accepted. 


The  instructions  of 
this  parser  are : 

M:  q^S 

qfA 
q^a 
q^b 
qiB 

q2J- 


the  formal  2SM  specification  of 

^  ACCEPT 

->  (A,q^)q2 

(a,q^)q2 
^  (b,q^)q^ 

(B,q^)q^ 
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q^a  -> 

(a  ,q3) q3 

qsb  -> 

(b  ,q3) qg 

(B  ,q3)q^ 

q3A  - 

(A,q3)q^ 

q4^  " 

(a,q^)q3 

(b,q^)q^ 

(A,q4)qg 

^4^  ^ 

(B  ,q4)q3 

(B.q]^)q5 

q^A 

(B»q3)q5 

-  q3A 

(B ,q4)q5 

q4A 

(B  ,qg)q5 

-  qgA 

(A, q^) (l 

qiS 

<a ,q^) (A 

.q3)q7  ^ 

q^A 

(a  ,  qg)  (A 

,q3)q7  ^ 

qgA 

(a  ,q^)  (A 

,q3)q7 

q^A 

(a  ,qg)  (A 

qgA 

^10-^ 

qgA  ■* 

(A,qg)q9 

qga  - 

(a  ,qg)q3 

qgb  ^ 

(b,q8)q^ 

qgB 

(B  ,qg)q5 
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(b,q^)(A,8)qg  ^ 

93aA 

(b  ,q^)  (A, 4) q^  ^ 

9^aA 

(b  ,  q^)  (A , 4) q^  ^ 

q^aA 

(b  ,qg)  (A,4)qg  ^ 

qgaA 

(a  ,q^)  (b  >93)  9]^ 0^ 

q^B 

(a  ,qg)  (b  >93)  93^0^ 

"is® 

(a,qg)  (b  .93)930^ 

qjB 

(a,q^)  (b  »93)930-^ 

This  particular  2 SM  does  not  contain  any 
inadequate  states.  Since,  in  any  s'tate,  there  is  at  most 
one  instruction  that  may  be  applied  at  any  time,  M  is  deter¬ 
ministic  in  its  operation.  In  general,  2SM’s  are  non- 
deterministic  devices.  The  next  section  investigates 
general  nonde t e rmini s t i c  2SM’s.  The  remainder  of  the 
thesis  will  examine  the  more  interesting  case  of  deter- 
minis  tic  2SM ’ s . 


Two  Stack  Machines  and  Type-0  Languages 

Clearly,  any  2SM  can  be  simulated  by  a  two  push¬ 
down  tape  machine  (Hopcroft  and  Ullman  1969).  Therefore, 
we  state  without  proof 

Theorem  3 . 1 

If  L  is  the  language  accepted  by  a  2SM,  then  L  is 
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a  TOL.  □ 


The  converse  of  this  theorem  is  also  true.  We 
shall  establish  this  result  by  describing  an  algorithm 
that  will  construct  a  2SM  from  any  TOG.  Several  other 
useful  results  will  be  developed  on  the  way. 


for  G  , 


Consider  any  TOG,  G  =  (N,T,P,S).  The  concept  of 

ic  strings  is  extended  to  all  TOG’s  so  that 

CS(G)  =  {wAp  I  S  =>*  x^ux^  =>  X  vx^  ,  x^v=w ,  and 
u  vApeP} 


From  G,  we  can  construct  the  containing  CFG , 

Gc  =  (N^,T^,P^,S)  where  P^  =  {X  ^  yAp  |  Xz  ^  yApeP}, 

=  {X  IX^yApeP^},  and  T^  =  (NuT)-N^.  An  important 
feature  of  G^  is  that  its  characteristic  strings  are  a 
superset  of  CS(G). 


Lemma  3 . 1 


Let  G  =  (N,T,P,S)  be  any  TOG  and  let  G  be  the 
containing  CFG  constructed  from  G.  CS(G)  is  a  subset  of 
CSiG^) . 

Proof  -  by  induction  on  the  number  of  steps  in  a  derivation. 


(i)  If  S  =>  w,  then  S  ->  wApeP.  Thus  S  ->■  wApeP^  and 
G  ^ 

and  wApeCS(G)  and  wApeCS(G  ). 
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(ii)  Assume  that,  for  some  integer,  n,  and  any  we(NuT)*,  if 


S  ^  ux„  =>  x,  vx  =  w,  using  u  vAp,  then 

G  f  ^  G  ^ 

we  have  x^vAp  in  CS(G)  and  x^vAp  in  GS(G^) . 


Consider  a  string,  z,  derived  in  n+1  steps: 
n  ■”  1 

®  =  yi‘>'2  5'i"5'2  '  ^ 

where  u  ->  vAp  and  t  ^  sAq  are  the  last  two  productions 
used  in  the  derivation.  By  definition  y^sAq  e  CS(G)  and, 
by  inductive  hypothesis,  x^ vA pe C S ( G^ ) .  Therefore 

S  x^vx 

G  12 

C 

Since  the  derivation  of  z  is  canonical,  the  first  symbol 
in  t,  call  it  T,  must  occur  somewhere  in  the  string  x^v. 
Thus 


S  E>*  x^vx^'  =  y^Ty^ 

c 


'  E> 


yisy2 


and  y^sAq  e  CS(G^). 


Thus  we  have  shown  that  if  wApeCS(G),  then 
wAp£CS(G^)  and  so  C S ( G ) c C S ( G^ ) ,  as  required.  □ 

For  any  CFG,G,  G^  is  identical  to  G  and  so 
CS(G)  =  CS(G^).  Knuth  (Knuth  1965)  showed  that  CS(G)  is 
always  a  regular  set.  If  a  grammar,  G,  is  not  a  CFG,  then 
CS(G)  and  CS(G^)  are  not  necessarily  equal  and  CS(G)  may 
not  be  regular.  Examples  are  given  in  Chapter  5  where  we 
discuss  parsers  based  on  CS(G). 
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It  is  clear  that  we  can  extend  the  definition 
of  the  k-augmented  grammar  to  all  type-0  grammars.  The 
use  of  the  augmented  grammar  has  two  advantages. 


(i)  If  k  >  0,  then  all  strings  that  we  parse  will  have 
an  endmarker,  i.  A  parser  will  always  know  when  it 
has  found  the  end  of  the  input. 

Ic  #  • 

(ii)  Because  the  goal  production,  ->  Si  AO,  is  unique 

and  S  appears  only  in  this  production,  a  parser  cai 

A 

interpret  a  reduction  for  this  rule  as  a  signal 
to  accept  the  input. 


We  can  also  construct  a  containing  context  free 
grammar  from  .  Since  G^^  is  a  CFG,  CS(G^^)  is  a 

regular  set.  From  Lemma  3.1,  CS(G  )cCS(G  ).  Consider 

A  A  L» 

any  regular  set  R  such  that  C S ( G . ) cRc C S ( G  ) .  A  finite 

A  A  U 

state  machine  control  for  a  2SM  can  be  constructed  from  R 


in  much  the  same  way  that  De  Remer  constructs  and  uses 
his  characteristic  finite  state  machines  (De  Remer  1969). 


Since  the  actual  construction  is  so  similar  to  the 
LR(0)  case,  we  shall  only  give  an  informal  outline  below. 

(i)  We  construct  a  finite  state  machine  that  recognizes 
R.  This  F SM  must  be  deterministic.  It  will  be 
structured  so  that  each  state  is  accessed  by  a 
unique  symbol  (this  is  always  possible).  The 
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destination  of  all  apply  symbol  transitions  is 
q^.  may  be  accessed  by  more  than  one  symbol 

but  will  have  no  transitions  itself. 

(ii)  Let  u  ->■  eAp  be  any  production  with  an  empty 
right  side  and  let  q  be  any  state  with  a  Ap 
transition.  This  transition  is  replaced  by  a 
transition  on  the  empty  string  to  a  new  state.  This 
new  state  has  the  single  transition,  Ap .  This 
transformation  is  done  in  all  states  of  this 
kind.  Reduce  transitions  in  this  FSM  are  inter¬ 
preted  to  apply  to  the  corresponding  productions 
in  .  The  notation  M ( G , R )  will  refer  to  the 
2SM  constructed  in  this  way. 

M(G,R)  possesses  many  interesting  properties.  In 
particular,  we  can  show  that  M(G,R)  accepts  an  input  if 
and  only  if  that  input  is  a  sentential  form  of  G^ .  Thus, 
if  the  terminal  set  of  M(G,R)  is  chosen  to  be  the  set  of 
terminals  of  G^ ,  then  the  1 anguage  accepted  by  M(G,R)  is 
exactly  L(G^).  First  we  need  some  notation.  Let 
M  =  (K ,  ,  T ,  I )  be  any  2SM  and  let  (X,q)  be  any  symbol  in 

.  If  we  are  not  interested  in  the  state  name,  we  can 
write  P (X)  for  this  symbol.  P(X)  is  the  "pushed  form" 
of  X.  This  notation  can  be  extended  to  strings  as 
foil ows  : 

P(Xy)  =  P(X)P(y)  for  all  X  e  V  ^  ,  y  e  V  . 

By  convention,  P(e)  =  e. 
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Lemma  3 . 2 


For  any  TOG,G, let  M(G,R)  be  the  2SM  constructed 
from  the  regular  set  R,  GS(G^)  £  R  £  GS(G^^),  where  G^ 
is  the  k-augmented  grammar  derived  from  G.  If  S  =>*  w, 


then  w  is  in  T(M(G,R)). 


Proof  -  by  induction  on  the  number  of  steps  in  the 
derivation  of  w. 


k  k 

(i)  S.  =>  w.  In  this  case,  w  =  Si  and  S  Si  AO  is 
A  p  A 

the  production  used  in  the  derivation.  Thus 
Si^AOeGSCG^)  and  Sl^AOeR.  M(G,R)  has  the  accept 


instruction  S 


k,  +, 


ACCEPT  and  so  q^Si  H  P(Sl  )q  [-  q^S^: 

h  ACCEPT  where  q  has  a  reduce  transition  corresponding 

•  k.  ' 

to  the  production  S^  Si  AO. 


(ii) 


n 


Assume  that  if  S^  =>  y,  then  q^yl-*ACCEPT  (that  is, 
y  is  accepted).  Consider  a  string,  z,  derived  in 


n+ 1  steps. 


n 


S^  =>  w  =  x^ux2  =>  ^1^^2  ~  ^  vAp  . 


Since  x^vAp  is  in  CS(G^)  it  is  also  an  element  of  R. 
Hence 


q^z  ].+  P(x  v)q 


X 


o  r 


1 ''  ^  '^^^2 

^1^  h  ^  )  (^  >  q  2^  *1^2  if  V  =  e.  This  portion  of 

the  parse  uses  read  instructions  only.  q  has  a  reduce 

instruction  corresponding  to  the  production  u  ->■  vAp  . 
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The  string  w  =  x^ux^  was  derived  in  n  steps. 

Thus,  by  inductive  hypothesis,  q  ^w|-*ACCEPT  .  Since  the 
derivation  of  z  is  canonical,  it  can  be  written  as 

y^ty2=>  y^sv^  =  x^ux^  =>  z 
and  y^s  is  not  a  prefix  of  x^ .  Thus 

q^w  1-  *  P(yj^s)q'y2  I"  P(yi)q3ty2  1"  *  accept 

or  q^w  h  *  PCy^)  (e,q3)q'y2  h  P(yi)q3Cy2  1“  *  ACCEPT  if  s  =  e 
The  string  x^  is  a  proper  prefix  of  y^s,  so 

q^z  I-  P(x^v)qx2  h  P(x^)q2UX2  h  ""  P(yis)q’y2  where  q’  is 
accessed  from  q^  by  read  instructions  only.  The  cases  for 
V  =  e  or  s  =  e  are  entirely  similar.  Thus 
q^z  |-  ACCEPT 
as  required.  □ 

Note  that  the  parse  of  z,  in  part  (ii)  of  the 
above  proof,  represented  a  canonical  derivation  in  reverse, 
of  z  from  .  The  following  corollary  is  now  self 
evident . 

C  o  r o 1 1 ary 


_  * 

For  any  canonical  derivation  of  a  string  w,  S  =>  w 
there  exists  a  parse  of  w  by  M(G,R)  that  simulates  that 
derivation  in  reverse.  □ 

The  next  lemma  establishes  the  converse  of  Lemma  3.2 
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I 


Lemma  3 . 3 

For  any  TOG,G,  let  M(G,R)  be  the  2SM  constructed 
from  the  regular  set  R,  CS(G^)  S  R  £  CS(G^^),  where  G^ 

is  the  k-augmented  grammar  derived  from  G.  If  we T (M ( G , R ) )  ,  i 

i 

I 

thenS.£>*w.  P 

A  I 

i 

P 

[ 

Proof  -  by  induction  on  the  number  of  instructions  applied  | 

i 

in  pars ing  w . 

(i)  If  q^^w  1“  ACCEPT, q^w  ACCEPT  is  the  accept  instructi( 
Thus  w  =  is  the  goal  symbol  of  G^  and  E>* 
asrequired. 


(ii)  Assume  that  for  all  m<n,  if  q^w  [-^  ACCEPT,  then 

S  =>*  w . 

A 

j 

Consider  a  string  z  accepted  after  the  use  of 

n  +  1  instructions.  That  is 
,  n+ 1 

q^zl-  ACCEPT- 

Let  us  assume  that  the  parse  of  z  can  be  written 

q^z  1-*  P(y^v)qy2  |-P(y^)q’uy^  \-^  ACCEPT. 

where  q  is  the  first  state  in  which  a  reduce  instruction 

is  applied.  The  input  7 2  accepted  by  M(G,R)  and  fewer  ^ 

than  n  +  1  instructions  are  applied  in  the  process.  Thus, 

by  inductive  hypothesis,  S  =>*  y  uy- .  The  reduce  instructio 

A  L  ^ 

applied  in  q  corresponds  to  the  production  u  vAp.  Consequen' 
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If  V 


e ,  then  we  have 


q^z  1-*  P  (y  (e  ,  q  ’  )  qy^  |-P(y^)q'uy2  |-*  ACCEPT.  Once  again 

Sa  =>*  yi^yz  ^ 

Combining  the  last  two  lemmas,  we  obtain 

Theorem  3 . 2 


For  any  TOG,G,  let  M(G,R)  be  the  2SM  constructed 
from  the  regular  set  R,  GS(G^)  £  R  E  GS(G^^) ,  where  G^ 
is  the  k-augmented  grammar  derived  from  G.  M(G,R)  accepts 
an  input  w  if  and  only  if  w  is  a  sentential  form  of  G^ . 

Proof 

The  proof  follows  immediately  from  the  previous 
two  lemmas ,  □ 

This  result  can  be  generalized.  Let  us  define 
the  restricted  charac teris  tic  s  tr ings  of  a  grammar  G  to 
b  e 

RCS(G)  =  {wAp  IwApeCS(G)  and  S  =>*  u^yu2=>  ~ 

u^v  =  WjZeT*  and  y  ->  VAP  is  a  production  in  G}. 

By  definition,  RGS(G)  £  CS(G).  In  a  CFG  in  which  all 
nonterminal  symbols  can  derive  a  terminal  string,  it  is 
known  that  RCS(G)  =  CS(G)  (De  Remer  1969).  For  a  general 
TOG,  even  where  all  productions  can  be  used  in  the 
derivation  of  a  terminal  string,  the  two  sets  are  not 


>*  z  > 


3-35 


necessarily  equal. 


A  simple  application  of  the  reasoning  in  the  previous 
lemmas  will  prove  the  following  theorem. 

Theorem  3.3 

Let  G  =  (N,T,P,S)  be  a  TOG  and  let  G^  =  , T^ , , S^) 

be  the  k-augmented  grammar  derived  from  it.  Let  R  be  a 
regular  set  such  that  RCS(G^)  E  R  £  CS(G^^),  and  let  M(G,R) 
be  the  2 SM  constructed  from  R. 

If  T^  =  T  u  {1}  is  chosen  to  be  the  terminal  set  of 
M(G,R),  then  the  language  accepted  by  M(G,R)  is  exactly 
L(G^)  =  L(G){l^} .  □ 

Since  G^^  is  a  CFG,  we  may  apply  Knuth’s  LR(0) 
constructor  to  compute  the  regular  set  CS(G  ).  Thus 

A  w 

for  any  TOG,  G,  we  have  the  following  parser  generator 
al gori thm ; 

(i)  construct  the  0-augmented  grammar,  G  ,  for 

A. 

G; 

(ii)  use  Knuth's  LR(0)  algorithm  to  find  the 
set  CS (G^^)  ; 

(iii)  the  2SM  M ( G , C S ( G^^ ) ) ,  is  a  parser  that  accepts  the 
language  L(G)  =  L(G^). 
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Theorem  3 . 4 


Let  L  be  any  TOL .  If  G  is  a  grammar  that  generates 
L  and  G  is  the  0-augmented  grammar  derived  from  G,  then 

r\. 

the  2SM  M(G,CS(G  ))  accepts  exactly  the  language  L.  Further- 
more,  this  2SM  is  effectively  cons  true  tab le  from  G. 

Proof 

The  construction  described  above  establishes  this 
result.  □ 

The  parsers  generated  from  CS(G  )  may  have  two 

A.  L* 

problems.  Firstly,  they  may  contain  inadequate  states. 

When  M(G,CS(G^^))  is  in  an  inadequate  state,  there  are  at 
least  two  instructions  that  may  be  used.  Thus  the  parser  is 
nonde t e rmini s t i c .  Secondly,  the  parser  may  not  halt  for 
some  inputs . 

In  the  next  section  we  discuss  two  methods  of 
dealing  with  inadequate  states.  The  halting  problem  is 
dealt  with  in  the  next  chapter. 


Look  Ahead  and  Determinism 

If  a  2SM  has  no  inadequate  states,  it  must  operate 
deterministically.  At  most  one  instruction  may  be  applied 
when  the  machine  ia  in  any  given  state.  A  2SM  with  no 
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inadequate  states  is  called  a  de terminis  tic  2  SM  (D2SM). 


1 


Knuth  (Knuth  1965)  has  shown  that  there  are  context 
free  languages  that  do  not  have  an  LR(k)  grammar  for  any  k. 
The  parser  generator  we  described  in  the  last  section 
corresponds  to  the  LR(0)  algorithm  if  the  original  grammar 

i 

is  context  free.  Consequently,  not  ^11  the  2SM^s  that  are  j| 

I 

constructed  from  CS(G^^)  will  be  deterministic.  We  can  J 

try  to  make  these  parsers  deterministic  by  adding  look  ahead 

We  propose  two  means  of  calculating  these  look  aheads 
analogous  to  De  Remer’s  SLR(k)  and  LALR(k)  algorithms.  The 
aim  of  a  look  ahead  algorithm  is  to  compute,  for  each  transiti 
of  an  inadequate  state,  a  set  of  finite  length  strings  that 
could  appear  at  the  top  of  the  R-STACK  when,  during  the  parse! 
some  input,  the  parser  has  arrived  in  this  state  and  ought 
to  follow  this  transition.  If  all  look  ahead  strings  are 
of  length  k,  then  the  minimum  requirement  of  the  look  ahead 
set  for  a  transition  is  that  it  include  all  length  k  strings 
that  could  be  found  at  the  top  of  the  R-STACK  if  this  i 

j 

transition  is  taken  and  the  input  is  accepted.  | 

] 

In  more  detail,  let  G  be  any  TOG,  let  G^  be  the  | 

k-augmented  grammar  derived  from  G,  and  let  G  be  the 

A  U 

containing  CFG  constructed  from  G  .  If  R  is  a  regular  set 

A 

such  that  RC S (G^) cRcC S (G^^ ) ,  then  M(G,R)  is  the  2SM  construct^ 
from  R.  In  general,  M(G,R)  will  contain  inadequate  states. 

I 

Assume  q  is  an  inadequate  state  and  that  look  ahead  sets  ! 
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have  been  found  (in  some  unspecified  manner)  for  each 

transition  from  q.  Let  us  number  the  transitions  from 

q.  State  q  may  have  more  than  one  read  transition  for 

the  empty  string.  The  destination  of  each  such  transition 

is  a  corresponding  adequate  reduce  state.  Each  of  these 

transitions  is  given  a  different  number.  The  look  ahead 

t  h 

set  for  the  t  transition  is  L(q,t),  a  finite  set  of 
strings.  L(q,t)  is  a  val i d  look  ahead  set  if  and  only  if 
for  all  state  strings  xq  and  all  weT^*  such  that 

w  xqz  ^  x’q'z’  |-  *  ACCEPT, 

M(G,R)  H(G,R)  M(G,R) 

where  the  first  instruction  in  the  parse  xqz  [-  *  ACCEPT 
corresponds  to  the  t^^  transition,  there  is  a  string  s€L(q,t) 
such  that  sePREFIX(z).  A  look  ahead  algorithm  that  computes  only 
valid  look  ahead  sets  is  also  said  to  be  valid.  Valid 
look  ahead  sets  may  (and  often  will)  contain  additional 
look  ahead  strings. 

We  will  assume  that  in  any  look  ahead  set  LCq,t) 
there  will  never  be  two  elements  s^  and  S2  such  that 
s ^ e PREF IX ( s 2 ) .  Informally,  if  there  are  two  such  strings, 
then  s^  "covers'*  s^  and  S2  may  be  deleted.  Whenever 
the  parser  could  have  looked  ahead  to  see  S2J  s^  will 
also  be  a  look  ahead.  Thus,  no  information  is  lost  by 
deleting  s  2 . 

An  inadequate  state,  q,  has  been  re  s  o 1 ve  d  if  and 
only  if  for  any  two  transitions,  t^  and  t2,  from  q,  there 
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are  not  strings  s^eL(q,t^)  and  s^eLCqjt^)  such  that 
s  ePREFIX(s  ).  The  information  in  a  look  ahead  set 
L(q,t)  can  be  imbedded  in  the  2SM,  M(G,R),  as  follows: 

(i)  if  t  refers  to  a  read  transition,  X  (X^^d), 
to  q',  then  replace  this  transition  with  a 
read-and-look-ahead  transition,  s,  for  every 
s  in  L(q,t).  By  definition,  FIRST(s,l)  =  X; 

(ii)  if  t  refers  to  a  read  transition,  e,  to  a 

state  q’,  then  q’  is  an  adequate  reduce  state. 
This  particular  e  transition  is  replaced  by 
read-and-look-ahead  transitions,  ev,  to  q’ 
for  every  string,  v,  in  L(q,t); 

(iii)  if  t  refers  to  a  reduce  transition,  Ap,  then 
replace  it  with  a  look  ahead  transition,  s, 
for  every  s  in  L(q,t).  Each  of  these  transitions 
goes  to  a  single  new  state  which  itself  has  a 
single  transition,  Ap .  Clearly,  if  all  inadequate 
states  are  resolved,  the  resultant  2SM  is 
de  t  e  rmini s  tic. 

From  our  discussion,  we  can  see  that  the  use  of 
valid  look  ahead  sets  means  that  the  new  2SM  accepts 
exactly  the  same  language  that  M(G,R)  does.  Furthermore, 
the  corollary  to  Lemma  3.2  is  still  valid.  For  every 
canonical  derivation  of  a  string  w,  there  will  be  a  parse 
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of  w,  by  the  new  2SM,  that  corresponds  to  that  derivation 
in  rever  se  . 

In  the  next  two  sections,  we  consider  look  ahead 
algorithms  that  may  be  applied  to  any  2SM,  M(G,R)  . 

Simple  Look  Ahead  Algorithm 

A  simple  look  ahead  algorithm  calculates  look  ahead  sets 
directly  from  the  grammar.  In  very  general  terms,  a  look 
ahead  string  of  length  k  for  a  read  transition,  X,  is  any 
string,  Xv  ( [ Xv |  =  k) ,  that  might  appear  in  a  sentential  form. 

For  a  reduce  transition  corresponding  to  the  production, 
uX  ^  V,  the  look  ahead  set  contains  strings  that  might  follow 
X  in  a  sentential  form.  Our  development  will  follow  the 
description  of  De  Remer’s  SLR  algorithm  in  (Anderson  et  al  1973) 
The  algorithm  will  be  complicated  by  productions  with  empty 
right  sides  . 

Let  G  =  (N,T,P,S)  be  any  TOG.  It  will  be  convenient 

to  define  the  sets  M,  M  ,  and  M  by 

Li  R 

M  =  {X  I  XeN  and  X  =>*  e  using  context  free  productions 
(of  the  form  A  u)  only}  , 
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{X 


Xe(NuT)  and  there  is  a  series  of  productions 


Xu 

> 

xi 

t 

1 

1  1 

X. 

u 

-  X„V- 

1 

2  2  2 

Xu  ^  e 
m  in+ 1 


and,  for  some  i,  u.|  >  l},  and 


M  =  {X  I  Xe(NuT)  and  there  is  a  series  of  productions 
R 

UiX  - 


“2^1  ^  ''2^2 


Let 


m-f  1 
and  , 


X  ^ 
m 

for 


e 

s  ome 


u  .  I  >  1  }  . 

1 


E  =  M*QM*  U  M*  . 
where  ,  .  . 

Q  =  {XyZ  ]  ye(NuT)*  and  either  XeM  and  ZeM  ,  or 

L  R 

XeM  and  ZeM,  or  XeM  and  ZeM  } . 

L  R 


If  y  =  Y  . . .Y  eE  and  Y.eM,  for  all  i  (1  <  i  <  n) ,  then 
ini 

y  =>*  e.  Furthermore,  there  is  an  algorithm  to  find  all 

members  of  M  (Anderson  et  al  1973)  .  If  yeM*QM* , 

then  it  is  possible  (but  not  certain)  that  y  =>*  e.  If  G 

is  a  CFG,  then  M  =  M  =  (})  and  E  =  M*  .  For  all  strings 

R  L 
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y'E,  we  know  that  y  =>*  e.  Since  we  can  easily  compute  M,  M 


and  M  ,  we  can  determine  whether  or  not  a  given  string  is  in  E. 
R 


With  the  definition  of  E,  we  define  two  functions  HEAD 
and  TAIL  by 

HEAD(X)  =  {Y  I  Y  =  X  or  Xu  ^  zVv,  zeE  and  YeHEAD(V)},  and 
TAIL(X)  =  {Y  I  Y  =  X  or  uX  ^  vVz ,  zeE  and  YcTAILCV)}. 

HEAD  and  TAIL  correspond  to  the  functions,  and  Za6t., 

in  (Anderson  et  al  1973).  If  YeHEAD(X),  then  it  is  possible 
that  Xv  =>*  Yv’ .  Similarly,  if  YeTAIL(X) ,  then  it  is  possible 
that  vX  =>*  v’Y.  Finally,  for  any  XeNuT,  we  define  the  set 
F(X,k)  as 

F(X,k)  =  {Yw  I  YeF(X,l)  and  weF(X,k-l)},  and 

F(X,1)  =  {Y  I  u  ^  ^1^1^2^1^3’  ^2^^’  YeTAILCY^), 

and  XeHEAD (X^) }  . 

If  S  =>*  uXYv,  then  YeF(X,l) .  Unless  G  is  a  CFG,  the  converse 
is  not  necessarily  true.  F  corresponds  to  the  function, 
in  (Anderson  et  al  1973). 


The  look  ahead  set,  of  length  k  strings,  for  the  t^ 
transition  from  state  q  is  L,  (q,t.)  where 

iC  1 

L,  (q,t.)  =  -{XF(X,k-l)  I  t.  refers  to  the  read  transition, 

K.  1  1 

X,  from  q,  where  X  ^  e}. 

-{x  t.  refers  to  a  transition,  e,  to  state 

q',  q'  has  the  single  reduce  transition, 

t  tl 

Ap  ,  and  the  p  production  is  Yv  ^  eAp, 

|x|  =  k,  and  vxe F ( Y , k+ | v | ) } . 
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-{x  I  refers  to  a  reduce  tr ans i t i on , Ap , 

the  production  is  Yv  ^  zAp ,  |x|  =  k, 

and  vx gF ( Y , k+ I  V  I  ) }  . 

In  this  case,  the  inadequate  state,  q,  is  resolved  if  and  only 

if,  for  all  i  7^  j  ,  L^(q,t^)  n  L^(q,tj)  =  (j)  . 

If  G  is  a  CFG,  this  algorithm  is  the  same  as  De  Remer ' s 

SLR(k)  algorithm  (De  Remer  1969)  except  that  it  includes 
nonterminals  in  the  look  ahead  set.  Because,  in  general,  the 
definition  of  a  canonical  derivation  does  not  imply  that  the 
absolutely  rightmost  nonterminal  is  expanded  at  each  step  in  a 
derivation,  it  is  possible,  during  the  parse  of  a  string  of 
terminals,  that  a  look  ahead  string  include  some  nonterminals. 
In  an  LR(k)  parse  of  a  string  of  terminals,  look  ahead  strings 
can  only  contain  terminal  symbols. 

Local  Look  Ahead 

Informally,  a  local  look  ahead  algorithm  involves 
simulating  M(G,R)  to  determine  what  inputs  would  be  accepted  if 
a  given  transition  is  taken.  The  algorithm  is  applied  to  each 
transition  of  an  inadequate  state  to  find  its  look  ahead  set. 

For  this  algorithm,  we  require  a  formal  definition  for  a 
finite  state  machine  (FSM)  and  for  a  pushdown  transducer  (PDT) . 
An  FSM,  M,  is  defined-to  be  (K,V,d,q^,F)  where  K  is  a  finite 
non-empty  set  of  states,  V  is  the  input  alphabet,  d  is  the  next 
state  mapping  from  K  x  (V  u  {e})  into  K,  q^  is  the  start  state, 
and  F  is  the  set  of  final  states  in  K. 

A  PDT  (Aho  and  Ullman  1972a)  is  a  pushdown  automaton 
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(X,^;fS'.  Therefore,  R(q,t,S)  may  contain  some  extraneous 
strings  that  would  be  excluded  by  a  direct  simulation  of  the 
2SM. 


Under  appropriate  conditions,  this  definition  can  be 
used  as  an  algorithm.  The  next  lemma  illustrates  this  fact. 

Lemma  3 . 4 


Let  q  be  any  state  in  M(G,R)  and  let 
S(q)  =  {xq  I  xq  is  a  state  string  of  M(G,R)}.  The  recursive 
definition  of  R(q,t,S(q)),  for  some  transition  t  from  q,  will 
involve  only  a  finite  number  of  different  sets,  R ( q ^  ,  t ^  ,  S ^ )  . 

Proof 


We  shall  show  that  if  R(q,t,S(q))  invokes  the  computation 
of  R(q.’,t.,S-)  then  S.  must  be  one  of  a  finite  number  of 
sets.  Since  there  are  only  a  finite  number  of  states  and 
transitions,  the  evaluation  of  R(q,t,S(q))  must  be  finite. 

Strings  in  S(q)  possess  an  important  property.  Let 
X  (X , q^ ) y (Y , q j ) zq  be  any  string  in  S(q).  If  q^  =  q^  ,  then, 
for  all  y’  such  that 

x’q^Xz  ’  |-  *  x’  (X,q^)y  ’q^ 

using  read  instructions  only,  x (X  ,  q^ ) y  '  (Y  ,  q j ) zq  is  in  S(q). 

If  q^  q^  ,  then  there  is  a  sequence  of  distinct  states, 

~  *li*j**'jq  ’  =  q-j  and,  for  all  (X,q.)zT...z  ,  such  that 
1  ^1  ’^n  ’  ’^1  1  n-1 
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x'qi'XYi  |-  *  X  ’  (X  ,  q^)  ' 


X 


*  x" z  ,  q 


using  read  instructions  only,  x(X,q.)z 


z  ,  (Y , q . ) zqe  S  (q)  . 
n-1  j 


property 


This  property  will  also  hold  for  any  set,  S^,  in 

R  ( q ^  ,  t ^  ,  S ^ )  .  Since  the  number  of  states,  pairs  of  states, 

and  sequences  of  distinct  states  is  finite,  must  be  one 

of  a  finite  number  of  sets.  To  show  that  all  S.  have  the 

1 

required  characteristic,  we  will  prove  that  the  definition  of 
R  preserves  the  sequence  property. 

Assume  has  the  sequence  property. 

Case  (i)~  t^  refers  to  the  reduce  transition,  Ap. 

t-  Vi 

R(q^',t^,S^)  c  {(Ap,qjj^’)w  |  weR(q,t,S)  where  the  p 

production  is  Xu  ->  Y.,  .  .  .Y  Ap  or 

In 

Xu  ^  e Ap , 

S  =  {x(X,q”)q  I  x(Y^ ,q") . . . (Y^,q^)q^ *6 
or  X ( e , q " ) q ^ ' e S ^ ,  respectively,  and 
q"  has  an  X  transition  to  q}, 
and  t  is  a  transition  from  q}. 

Consider  any  string,  x ( Z , q ^ ) y ( Y , q j ) zq  in  S.  If  z  =  e, 
then  (Y,qj)  =  (X,q")  and  x (Z , q^ ) y (Y^ , q" ) . . . (Y^ , q^) q^ ’ 

(or  X ( Z , q ^ ) y ( e , q " ) q ^ ’ )  is  in  S^.  Since  has  the  sequence 
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property,  we  know  that 

(a)  if  =  q”,  then,  for  all  y'  such  that 

x’q^Zz  ’  |-  *  x’  (Z  ,q^)y  'q^ 

using  read  instructions  only, 

x(Z,q^)y'  (Y^,q'')  .  .  .  (Y^,q^)q^’  (or  x  (  Z  ,  q  ^  )  y  '  (  e  ,  q  ”  )  q  ^  ’  ) 
is  in  S^.  Thus,  x (Z , q^ ) y ' (X , q" ) q  is  in  S;  and 

(b)  if  q^  ^  q",  then  there  is  a  sequence  of  distinct 

states  q^  =  q^'',...,q^"  =  q"  ,  and,  for  all 

Zz,...z  ,  such  that 

1  n-1 

x'qi"Zyi  |-  *  X  '  (Z  ,q^")  z^q^ 


X 


*  x"z  ^q  " 

T-l  —  I  *  1 


"'•n-i'yn-l  I-  "  “n-l’n 
using  read  instructions  only. 


x(Z,q^)z^.  .  .  >q")*--(Y^,q^)q.’  (or 

X  (Z  ,  q  .  ) z 1  •  •  • z  T(e,q")q.')  is  in  S..  Thus 
^il  n-l’^^i  1 

X  (Z  ,  q  ^  )  z  .  .  .  z^_  ^  (X  ,  q  "  )  q  is  in  S. 

In  either  case,  the  sequence  property  is  satisfied. 


If  z  =  z'(X,q"),  then  x ( Z , q ^ ) y ( Y , q ^ ) z ’  ( Y ^  ,  q " )  .  .  .  ( Y^ , q^ ) q ^ ' 
(or  X (Z , q ^ ) y ( Y , q j ) z ’  ( e  ,  q " ) q ^ * )  is  in  .  Since  has  the 
sequence  property,  an  argument  similar  to  the  one  above  will 
show  that  the  sequence  property  is  satisfied  by 

X  (Z  ,  q^ )y (Y , q j ) zq .  Thus  the  sequence  property  is  preserved  in 
this  case  of  the  right  context  function. 
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Case  (ii)-  refers  to  a  read  transition,  X,  where  X  may  be 

the  empty  string. 

R(q.  ',t,S.)  c  {(X,q.  ’)  zw  I  we R  ( q ^  ,  t  '  ,  S ”  )  ,  q^  has  a  reducj 

transition,  Ap  (t'  refers  to  this 
transition),  zeS'  where  | 

S’  =  {y  I  x(X,q^')yq^  is  a  state  stringy 
for  some  x } ,  and 

S"  =  {x(X,q^')yq^  |  xq^'eS^  and  yeS'}}. 

Since  S’  includes  all  strings  that  access  q^  from  q^’ 
and  S^  has  the  sequence  property,  it  is  easy  to  show  that  S" 
also  has  this  property.  Consider  a  string  x (Z , q^ ) y ( Y  ,  q ^ ) zq^ 
in  S".  Either  x ( Z , q ^ ) y ( Y  ,  q ^ ) z  ’  q ^  ’  is  in  S^,  or 
x(Z,q^)y’q^’  is  in  S^  and  y"(Y,qj)z€S’  (y’y”  =  y),  or 
x’q^’  is  in  S^  and  x" ( Z  ,  q ^ ) y ( Y  ,  q ^ ) z e S  ’  (x’x"  =  x)  .  In  each  of 
these  cases,  we  can  conclude  that  S"  must  have  the  sequence 
property . 

Now,  since  the  number  of  pairs  of  states  is  finite  and 

the  number  of  . sequences  of  distinct  states  q^’,...,q^’  is 

finite,  there  can  only  be  a  finite  number  of  sets,  S. .  Thus, 

1 

at  some  point  in  the  calculation  of  R(q,t,S(q)),  no  new  sets 
R  ( q  £  '  ,  t  ,  S  ^  )  will  have  to  be  computed.  □ 

We  can  construct  an  FSM  that  recognizes  R(q,t,S(q)). 
Informally,  the  states  are  labelled  with  the  various  function 
names.  In  case  (i)  of  the  above  proof,  state  R(q£’,t£,S£)  has 
a  transition  (Ap,q^’)  to  state  R(q,t,S).  In  case  (ii),  note 
that  S'  is  a  regular  set.  State  R(q£',t£,S£)  has  a  transition 
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The  final 


(  to  the  start  state  of  an  FSM  recognizing  S', 

states  of  this  FSM  have  transitions  on  the  empty  string  to 
state  R(q^,t',S"). 

Thus,  R(q,t,S(q))  is  a  regular  set.  Let  us  assume  that 
the  FSM  has  been  transformed  into  a  deterministic  version  (this 
can  always  be  done  (Hopcroft  and  Ullman  1969))  and  call  this 
new  FSM,  M. 


To  translate  a  string  in  R(q,t,S(q))  to  the  look  ahead 
string  in  M(G,R),  consider  the  sample  string 

(X^ , q^ )  ( Ap ^  ,  q  '  )  (X^ , q2 ’ )  (X^ , q^ ’ )  •  X^  is  the  first  symbol  in  the 
translated  look  ahead  string.  If  the  production  is 

YX^X^v  ->  yAp^,  then  we  push  X^X^v  into  a  stack.  These  symbols 
must  be  matched  by  subsequent  symbols  in  the  original  look 
ahead  string.  Thus,  when  we  read  (X2,q2*)  (X2»q3’)»  the 

symbols  X^  and  X^  are  removed  from  the  stack. 

This  example  shows  that  the  translation  can  be  performed 
by  a  PDT .  A  formal  definition  of  this  transducer  is  given 
below.  Let  M  =  (K,V,d,s^,F)  be  the  FSM  recognizing  R(q,t,S(q)). 
The  PDT  will  be 

P  =  (K,V, V  ,V'  ,d'  ,s^ 

where  V'  is  the  union  of  the  vocabulary  of  the  original  grammar 
G,  the  goal  post  i,  and  the  symbol  Z^,  and  d'  is  defined  by 
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(i)  if  d(s,(X,q))  =  s’,  then 

d  ’  ( s ,  (X , q)  , X)  =  {(s’,e,e)}  and 

d’  (s  ,  (X,q)  ,Zq)  =  {  (s  ’  ,Zq,X) } ; 

(ii)  if  d(s,(Ap,q))  =  s',  then,  for  all  ZeV' 

d '  (s  ,  ( Ap  ,  q)  ,  Z )  =  {(s',xZ,e)} 

t  tl 

where  the  p  production  in  G  is  Xx  ->  y; 

(iii)  if  d’(s,X,Y)  is  not  defined  by  one  of  the  above  cases 
then  the  FDT,  P,  has  found  an  error. 


From  P,  we  can  construct  a  CFG,  G' ,  for  the  set  of 

output  strings  (Aho  and  Ullman  1972a) .  The  set  of  length  k 

t  tl 

look  ahead  strings  for  the  t  transition  from  state  q  in 
M(G,R)  is 

L  (q,t)  =  {z  I  either  |z|  =  k  and  S  =>*  zw  or 

k  g , 

1 z I <k  and  S  =>*  z } . 

G  ' 

The  2SM,  M(G,R),  is  based  on  the  set  R,  RC S ( G^ ) cRcC S ( G^^ ) .  If 
G^  is  the  k'-augmented  grammar  constructed  from  G  and  k'<k, 
then  some  strings  in  L  (q,t)  can  be  less  than  k  symbols  long. 

IX 

As  usual,  the  inadequacy  of  a  state,  q,  is  resolved  if  and  only 
if  for  any  two  transitions,  t^  and  t ^ ,  and  any  two  strings, 
x^€L^(q,t^)  and  X2eL^(q,tj),  x^  is  not  a  prefix  of  x^ 
(x^^PREFIXCx^) ) . 

As  it  stands,  this  look  ahead  scheme  is  not  very  practice. 
It  has  been  described  for  reasons  of  completeness  and  because 
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more  efficient  methods  based  on  this  algorithm  may  be  possible. 


A  direct  simulation  of  M(G,R),  rather  than  the  more 

round  about  methods  we  have  described,  might  be  more  practical. 

This  approach  would  represent  an  extension  of  De  Remer’s 

LALR(k)  algorithm  (Dh  Remer  1969).  As  an  example  of  how  this 

technique  could  be  applied,  consider  the  grammar 

G  =  ({S^,S.B,c},{a  , b  ,  c  ,  1 } , P , )  with  productions 

S,  ^  SiAO 
A 

S  ->  aSBCAl 
S  ->  aBCA2 
CB  BCA3 
aB  ^  abA4 
bB  ^  bbA5 
bC  bcA6 
cC  c c A  7  . 

A  2SM  that  accepts  the  language,  L(G),  is  shown  on  the  next  page. 

States  7,  11,  12,  and  15  are  inadequate.  In  state  11,  we 

can  either  read  b  or  perform  a  reduction  for  production  A5  . 

A  look  ahead  set  for  the  read  transition  is  {b}.  We  will  try 
to  obtain  a  look  ahead  set  for  the  reduce  transition  by 
simulating  the  2SM. 

If  the  2SM  is  in  state  11,  the  L-STACK  must  contain  a 
string  in  the  set  ( ( a  ,  1 ) } { ( a , 3 ) } * { (b , 3 ) } { (b , 7 ) } { (b , 1 1 ) } * .  If 
the  reduction  is  performed,  the  new  state  of  the  2SM  will  be 
3,7,  or  11  and  bB  will  have  been  pushed  into  the  R-STACK.  In 
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either  of  these  states,  the  2SM  will  proceed  to  read  the  b 
(otherwise,  the  parse  cannot  be  canonical).  At  this  point, 
one  of  three  cases  holds: 

(i)  the  parser  is  in  state  7  and  the  L-STACK  contains  a 
string  in  { ( a  ,  1 ) } { ( a , 3 ) } * { (b , 3 ) } ; 

(ii)  the  parser  is  in  state  11  and  the  L-STACK  contains 
a  string  in  the  set  { ( a  ,  1 ) } { ( a , 3 ) } * { ( b , 3 ) } { ( b , 7 ) } ; 

(iii)  the  parser  is  in  state  11  and  the  L-STACK  contains 
a  string  in  the  set 

{(a,l)}{(a,3)}*{(b,3)}{(b,7)}{(b,ll)}''. 

In  case  (iii),  since  the  top  symbol  in  the  R-STACK  is  B,  the 
reduce  transition,  A5,  must  be  taken.  However,  having 
performed  the  reduction  and  having  read  the  b  at  the  top  of 
the  R-STACK,  the  2SM  could  be  in  the  situation  described  in 
(iii),  once  again.  Clearly,  the  simulation  of  the  2SM  can  loop 
forever  in  this  way,  each  iteration  resulting  in  another  B  at 
the  top  of  the  R-STACK.  In  the  LALR(k)  algorithm,  a  loop  of  this 
type  can  occur,  but,  of  course,  the  contents  of  the  R-STACK 
will  remain  unchanged  since  only  a  single  symbol  is  pushed  into 
the  R-STACK  by  a  reduction,  and  this  symbol  is  read  immediately. 
Informally,  the  look  ahead  strings  we  would  discover  will  be 
the  same,  no  matter  how  many  times  our  simulation  goes  around 
the  loop.  Thus,  we  only  need  to  simulate  one  iteration.  In 
the  general  case,  it  is  not  clear  that  one  iteration  is 
sufficient  to  discover  all  look  ahead  strings.  In  our  example, 
we  may  note  that  the  loop  can  result  in  an  arbitrary  number  of 
B's  in  the  R-STACK;  nevertheless,  all  of  these  symbols  will  be 
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read  eventually  by  state  10  (which  can  read  any  number  of  B's).| 
Thus,  for, this  2SM,  one  iteration  of  the  loop  is  sufficient. 

We  conjecture  that  there  is  a  local  look  ahead  scheme 
designed  along  the  lines  of  the  above  informal  method.  To  be  ‘ 
practical  (that  is,  the  method  must  terminate  eventually),  some' 

means  of  tackling  the  loops  we  have  described  must  be  developed, 

1' 

t 

A  solution  to  this  problem  would  be  of  practical  importance. 

I 

In  Chapter  6,  we  propose  a  parser  generator  that  finds  th( 
required  look  ahead  sets  as  part  of  the  parser  construction 
process.  The  parsers  that  are  produced  will  also  have  the 
added  advantage  of  being  able  to  detect  errors  very  early  in  th( 
parse.  : 

i 

I 

D2SM’s  as  Canonical  Parsers 


Assuming  we  can  resolve  all  inadequate  states  using  some 
valid  look  ahead  algorithm,  we  will  now  have  a  deterministic 
parser  based  on  the  original  grammar.  An  input  that  is 
accepted  will  be  parsed  in  a  unique  way.  From  the  corollary 
to  Lemma  3.2,  we  may  state  the  following  theorem. 

Theorem  3 . 5 

Let  G  =  (N,T,P,S)  be  any  TOG,  let  G^  =  , T^ , P^ , S^)  be 

the  k-augmented  grammar  constructed  from  G,  and  let  G  be 

A.  L» 

the  containing  CFG  derived  from  G  .  If  R  is  a  regular  set  such 

A 
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that  RC S ( G  ) cRcC S ( G  ) ,  then  M(G,R)  is  the  2SM  constructed  from 
A  A 

R.  Let  M  be  a  D2SM  constructed  from  M(G,R)  using  any  valid 
look  ahead  algorithm.  M  accepts  the  input,  w,  if  and  only  if 
the  parse  of  w  simulates  a  canonical  derivation,  in  reverse,  of 
w  from  S^.  In  other  words,  M  is  a  canonical  parser. 

Proof 


This  theorem  follows  immediately  from  Theorem  3.3  and 
the  corollary  to  Lemma  3.2.  □ 

Corol lary 

Let  G  be  any  TOG  and  let  G^  be  the  k-augmented  grammar 
derived  from  G.  Finally,  let  M  be  any  D2SM  constructed  as  in 
Theorem  3.5.  G  must  be  unambiguous. 

Proof 

The  language  accepted  by  M  is  L(G  )  and  every 

A 

input  that  is  accepted  is  parsed  in  a  unique  way.  From 
Theorem  3.5,  this  parse  is  a  canonical  derivation  in 
reverse.  Thus,  by  definition,  G  is  unambiguous.  □ 

The  above  theorem  is  of  particular  importance  since  it 
guarantees  that  reductions  are  applied  in  a  predictable  order. 
Code  synthesis  is  often  linked  to  both  the  reductions  and  the 
order  in  which  they  are  applied  by  the  parser  (see,  for  example, 
(Gorrie  1971)). 
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We  conclude  this  chapter  with  a  discussion  of  the  time 
requirements  of  D2SM’s. 

Time  Efficiency  of  D2SM*s 


Haist  (Haist  1971)  has  obtained  an  upper  bound  on  tbe 
time  required  by  an  LR(1)  parser  to  accept  an  input.  Informa 
the  basic  operations  of  the  parser  are  assumed  to  take  one  ■ 
time  unit.  Thus  the  number  of  basic  operations  that  a  parser| 
performs  will  be  equal  to  the  amount  of  time  used.  A  simple 
extension  of  Haist's  results  will  yield  a  bound  for  LR(k) 
parsers.  For  an  input  of  length  n  whose  parse  requires  r  red, 
operations,  the  bound  is  composed  of 

1 

reads 


n 

r 

n+ ( 2+k ) r 
n+r 

n+  r 
kr 


reductions 
state  changes 
stack  pops 
stack  pushes 
look  aheads 

r  examinations  of  the  stack 

n+(2+k)r  state  table  look  ups. 

The  bound  is  5n+(8+3k)r.  LR(k)  parsers  are  said  to  require  ai 
time  linear  in  the  length  of  the  input.  The  above  time  bound 
shows  this  linearity. 


A  similar  calculation  is  possible  for  general  D2SM’s. 

If  a  string  is  derived  in  r  steps,  and  the  length  of  the  left 

t  ll 

side  of  the  production  used  in  the  i  step  is  l(i) ,  then  we 
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de  f ine 


r 

s  =  E  (l(i)  -  1)  . 
i  =  l 

For  a  D2SM,  the  time  bound  is  composed  of 


n+  s 

reads 

r 

reductions 

n+ (2+k) r+s 

state 

changes 

n  +  r+  2  s 

stack 

pushes 

n+  r +  2  s 

stack 

pops 

kr 

look 

aheads 

r 

examinations  of  the  stack 

n+ ( 2  +  k ) r  +  s 

state 

table  look  ups . 

The  time  bound  is  the 

5n  +  (8  +  3k )r  +  7s. 

We  cannot  make  the  claim  that  a  parse  requires  linear 
time.  For  LR(k)  parsers,  r  is  linearly  proportional  to  n.  In 
the  general  case,  this  is  not  so.  However,  we  can  make  a 
qualitative  claim.  Since  s  is  a  function  of  r  and  l(i),  the 
amount  of  time  required  beyond  the  linear  bound  is  proportional 
to  the  number  of  non-context  free  productions  involved  in  the 
derivation  of  the  input. 

This  conclusion  is  very  important  because  a  parse  that 
involves  no  non-context  free  productions  will  only  require 
linear  time.  When  additional  time  is  required,  it  is  proportional 
to  the  number  of  non-context  free  productions  used  in  the 
derivation  of  the  input.  While  the  D2SM  model  is  capable  of 
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parsing  non-context  free  languages,  a  D2SM  based  on  a  CFG  will 
still  parse  any  input  in  linear  time.  There  is  no  loss  in 
time  efficiency  as  a  result  of  the  increased  power  of  the  two 
stack  machine . 
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CHAPTER  4 


HALTING  TWO  STACK  MACHINES 


A  practical  parser  must  halt  after  a  finite  number  of 
parse  steps,  independent  of  the  input  string.  We  shall  see  that 
this  is  not  a  general  property  of  D2SM's.  Thus,  those  D2SM’s 
that  do  halt  on  all  inputs  (halting  D2SM ' s )  form  the  largest 
class  of  2SM’s  that  are  of  interest  in  the  context  of 
compilation.  This  chapter  examines  these  two  stack  machines 
and  the  halting  problem. 

In  Chapter  I,  we  indicated  that  our  main  interest  was  in 
another  class  of  D2SM’s,  those  that  detect  errors  as  early  as 
possible  in  a  parse.  These  D2SM’s,  in  fact,  form  a  proper 
subset  of  the  class  of  halting  D2SM’s.  As  a  result,  we  do  not 
intend  to  study  the  general  halting  problem  in  detail.  However, 
our  discussion  here  should  provide  a  framework  for  the  results 
and  arguments  of  the  next  chapter. 

The  Halting  Problem 

In  .general ,  we  say  a  D2SM,  M,  has  a  halting  problem  if 
there  is  some  input  for  which  M  does  not  halt.  In  the  last 
chapter,  we  described  a  parser  generator  and  two  look  ahead 
algorithms.  For  some  TOG’s,  these  algorithms  will  construct 
a  D2SM.  Determinism  in  a  2SM  is  not,  however,  a  guarantee 
that  it  will  always  halt.  Consider  the  grammar 
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G  =  ( { S  , S  ,  C , D } , { a  ,  d  ,  b , 1 }  ,  P , S  )  with  productions 

A  A 

S.  ^  SiAO 
A 

S  ^  CdAl 
Ca  ^  DbA2 
Db  ^  CaA3 
C  c  A  4  . 

The  2SM,  M(G,CS(G^)),  can  be  represented  by  the  following 
state  diagram : 


M(G,CS(G^))  is  deterministic;  however,  the  input  ca  puts  the 
parser  into  an  infinite  loop. 


Informally,  the  problem  is  caused  by  the  fact  that 
CS(G^)  3  CS(G).  The  characteristic  strings  DbA2  and  CaA3  in 
CS(G  )  -  CS(G)  are  responsible  for  the  loop.  If  the  finite 
control  had  been  constructed  from  CS(G),  there  would  be  no 
halting  problem.  This  is  the  approach  to  parser  constructio 
that  is  taken  in  Chapter  5.  Nevertheless,  extra  characteris 
strings,  such  as  DbA2  and  CaA3,  do  not  necessarily  result  in 
halting  problem.  For  example,  if  the  production  Db  CaA3  i 
removed  from  the  grammar,  the  halting  problem  is  eliminated. 
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Halting  D2SM*s  and  Deterministic  Parsable  Languages 


A  D2SM  that  halts  on  all  inputs  is  called  a  halt ing  D2  SM . 
By  definition, 

(i)  a  halting  D2SM  will  always  halt  and  accept  or  reject 
the  input ; 

(ii)  halting  D2SM’s  are  deterministic; 

(iii)  these  parsers  perform  a  single  left  to  right  scan  of 
the  input . 

However,  halting  D2SM's  do  not  necessarily  detect  an  error  as 
early  as  possible  in  a  parse.  Thus,  the  importance  of  halting 
D2SM’s  and  the  equivalent  class  of  grammars  is  that  they  embody 
s  ome  of  the  parser  characteristics  we  discussed  in  Chapter  3. 
Parsers  that  embody  all  of  these  characteristics  are  discussed 
in  Chapter  5 . 

If  G  is  a  TOG,  G^  is  the  k-augmented  grammar  derived  from 
G,  and  G^^  is  the  containing  CFG  constructed  from  G^ ,  then 
M(G,R)  is  the  2 SM  constructed  from  some  regular  set  R,  where 
RCS(G^)  £  R  £  CS(G^^) .  If  M'  is  a  halting  D2SM  obtained  from 
M(G,R)  using  only  valid  look  ahead  sets  (valid  look  ahead  sets 
and  the  required  construction  are  discussed  in  Chapter  3),  then 
G  is  de terminis  tic  parsable  (DP).  If  a  TOL ,  L,  equals  L(G) 
for  some  deterministic  parsable  grammar,  G,  then  L  is  a 
de  t e rmini s  tic  parsab le  1 anguage . 
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Characterizatio n  of  the  Halting  Problem 


Let  M  be  any  D2SM.  If  M  has  a  halting  problem,  there  i 


a  state  q  and  a  string  y  such  that  for  any  state  string  xq  ar 


any  string  z 


for  all  i  >  0.  The  strings,  x  and  z,  will  never  be  used. 


In  general,  the  length  of  the  string  2  tend  ti 


become  larger  and  larger.  There  is  a  special  kind  of  haltin; 
problem,  which  we  have  already  called  a  loop,  where 


and  i  j  .  In  this  case,  M  is  transforming  a  parse  string  i  o 
itself  over  and  over  again.  A  D2SM  constructed  from  a  CSG  ci 
only  have  a  halting  problem  of  this  type,  since,  for  such  a 
D2SM,  the  parse  string  can  never  increase  in  length  during  a 
parse. 


From  our  observations  above,  we  can  obtain  two 


independent  necessary  and  sufficient  conditions  for  a  D2SM,  , 
to  be  a  halting  parser.  These  conditions  are  essentially 
restatements  of  what  happens  when  M  fails  to  halt. 

Condition  I  I 

Let  yq  q’u  be  any  reduce  instruction  in  M.  For  any 
state  string  xyq  ,  for  any  and  for  any  weV^^*  ,  either 


(i)  xyqZw  [-  xq’uZw  |— *  x'q"w,  or 
(ii)  an  error  is  found  in  uZ  . 


If  M  has  a  halting  problem,  then  there  must  be  a  reduce 
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i 


instruction  that  violates  this  condition.  On  the  other  hand, 
if  the  condition  does  not  hold,  a  halting  problem  exists  by 
definition . 

Intuitively,  the  D2SM’s  of  the  next  chapter  will 
halt  because  they  are  designed  so  that  (i)  is  true  for 
state  strings  xyq . 

Condition  II 

Let  yq  q’u  be  any  reduce  instruction  in  M.  For  any  state 
string  xyq  and  any  string  zeV  *  either 

(i)  xyqz  |-  xq’uz  [-*  xvq"z’  |-  x*q''’u’z'  wh  ere 

x’  4^  X, 

(ii)  (i)  does  not  hold  and  an  error  is  found  in  uz  , 


(iii) 

X  =  e  , 

q’ 

=  ^1’ 

and 

yqz  |-  q^uz  |-  *  ACCEPT,  or 

(iv) 

X  =  e  , 

q’ 

=  ^1» 

and 

an  error  is  found  in  uz . 

pen  Problem 

In  Chapter  6,  we  describe  a  parser  generator  that 
constructs  halting  D2SM's  that  detect  an  error  as  early  as 
possible  in  a  parse.  This  algorithm  will  not  be  applicable  to 
all  DP  grammars.  In  some  circumstances,  a  more  general 
algorithm  (one  which  constructs  halting  D2SM's  that  do  not 
necessarily  detect  errors  as  early  as  possible)  could  be  useful. 
For  example,  consider  the  bootstrap  development  of  a  compiler 
(McKeeman  et  al  19  70)  .  Parsers  in  intermediate  versions  of 


a Iway  s 
all 
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the  compiler,  which  are  only  used  a  few  times,  need  not  have 
the  good  error  detecting  properties  described  above,  although 
it  is  highly  desirable  that  the  final  version  of  the  parser 
does. 

Undoubtedly,  from  a  theoretical  point  of  view,  the 
development  of  this  more  general  parser  generator  would  also 
provide  a  good  deal  of  insight  into  the  halting  problem  itse!, 

We  have  noted  that  a  non-halting  D2SM  may  result  from 
the  parser  generator  of  Chapter  3  because  of  strings  in  the  :| 
CS(G^)  -  CS(G).  In  very  general  terms,  one  approach  to  the  ■ 
problem  of  constructing  halting  D2SM's  is  to  modify  the  prev:5 
parser  generator  so  that  characteristic  strings  that  might  CiS 
a  halting  problem  are  detected  and  excluded.  For  example,  tl  : 
LR(0)  constructor  that  is  used  to  compute  CS(G  )  could  be 
altered  so  that  some  items,  that  otherwise  would  have  been 
included,  are  not  added  to  state  sets.  Intuitively,  this  chij^ 
might  involve  a  check  that,  if  the  item  (Au  ^  AvAp,e)  is  to  1 
added  to  a  state  set  Q(q) ,  then  for  some  or  all  state  string: 
xq ,  the  left  side  of  production  p,  Au ,  will  be  found  to  be  e;o 
free. 

Error  Detection 


In  general,  halting  D2SM*s  lack  one  desirable  characte; 
istic;  they  do  not  necessarily  detect  an  error  as  soon  as 
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pr  si1  Le  in  the  left  to  right  scan  of  the  input.  Consider  the 
grammar  G  =  ( { , S  ,  B , C } , { a , b , c , l }  ,  P , )  where  the  productions 

are 


-> 

SlAO 

s 

-> 

aSBCAl 

s 

-> 

aBCA2 

CB 

BCA3 

aB 

ab  A  4 

bB 

bbA5 

bC 

b  c  A  6 

cC 

c  c  A  7  . 

The  finite  state  control  of  a  halting  D2SM  based  on  CS(G  )  is 
shown  on  the  next  page.  Look  ahead  transitions  are  shown  in 
curly  brackets  ({...}).  L(G)  =  {a^b^c^l  |  n  >  1).  The  string 
a^'^^b^c^l  is  not  in  the  language  generated  by  G.  However, 
the  parser  will  reduce  the  input  a^^^b^c^l  to  the  form  aSl 
before  detecting  an  error  in  state  5.  This  error  might  have 
been  detected  as  soon  as  the  parser  had  read  the  first  c.  At 
that  point,  it  can  be  established  that  the  number  of  a’s  and 
b's  is  not  equal.  Once  the  error  is  actually  discovered,  the 
b's  and  c's  have  been  reduced  to  S  (along  with  some  a's), 
leaving  very  little  information  for  any  error  recovery 
mechanism  to  work  with. 

As  an  alternative,  consider  the  grammar 
G’  =  (  { , S , A , B } , { a , b , c , 1 } , P '  , )  with  productions 
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S,  ->  SiAO 
A 

S  ^  ABScAl 
S  ^  AbcA2 
Ab->  abA3 
Aa-^  aaA4 
Bb^  bbA5 
BA->  ABA6. 

L(G’)  =  L(G)  =  {a’^b’^c’^l  |  n  >  l}.  A  halting  D2SM  that  accepts 

the  language  L(G')  is  depicted  in  Figure  4.3.  The  parse  of 
a^^^b^c’^l  ends  when  an  error  is  found  in  state  15.  All  of  the 
a's  and  b’s  will  be  read  but  no  c’s  are  scanned.  Thus  the  error 
is  found  as  early  as  possible  in  the  left  to  right  scan. 

In  a  compiler,  the  ability  to  detect  errors  early  in  a 
parse  is  very  desirable.  In  part,  this  characteristic  is 
responsible  for  the  success  of  LR(k)  parsing  methods.  The  next 
chapter  examines  halting  D2SM’s  that  possess  this  error 
detection  property. 
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CHAPTER 


5 


REGULAR  PARSABILITY  AND  ERROR  DETECTION 

In  Chapter  3,  we  discussed  the  desirable  character¬ 
istics  we  would  want  in  a  parser.  Among  these  was  the 
property  that  errors  should  be  detected  as  soon  as  they 
are  encountered  in  the  left  to  right  scan  of  the  input. 

In  particular,  we  say  that  a  parser  which  never  reads  past 
a  point  at  which  it  can  be  established  that  the  input 
contains  an  error  has  the  error  property  .  Parsers  with 
the  error  property  signal  an  error  as  soon  as  the  contents 
of  the  L-STACK  cannot  represent  a  prefix  of  an  acceptable 
sentential  form.  None  of  the  parsers  developed  in  previous 
chapters  (except  LR(k)  parsers)  necessarily  has  the  error 
property. 

There  are  numerous  advantages  to  be  gained  from  using 
parsers  with  the  error  property.  Among  the  benefits  are: 

(i)  because  the  actual  location  of  the  error  is  found, 
the  programmer's  task  of  finding  and  correcting  the 
error  should  be  easier;  and 

(ii)  error  recovery  should  be  easier  too.  In  general, 
more  information  should  be  available  to  use  in  the 
recovery  process  since  the  error  is  detected  early 
in  the  parse. 

The  error  property  is  one  of  the  reasons  for  the  success 
of  LR(k)  parsing  methods. 
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We  shall  see  that  D2SM*s  with  the  error  property  will  ^ 
halt  on  all  inputs.  Consequently,  these  devices  correspond 
exactly  to  the  class  of  parsers  that  meet  all  the  requiremen 
set  out  in  Chapter  3.  Moreover,  the  similarity  between  D2SM 
and  LR(k)  parsers  suggests  that  much  of  the  research  related 

to  LR(k)  parsing  (table  reduction  techniques  (Joliat  1973),  ■ 

i 

for  example)  will  also  be  applicable  to  D2SM's.  This  chapte 
deals  with  both  parsers  and  grammars  that  embody  the  error 
property . 

D2SM*s  with  the  Error  Property 

For  the  following  discussion,  it  will  be  convenient  to 
define  the  n^  look  ahead  version  of  a  D2SM.  For  any  D2SM, 

M  =  (K,V  ,V  ,T,I) ,  the  no  look  ahead  version  of  M  is  a  2SM  | 

M*  =  (K , V  , V  ,  T , I ’ )  •  I’  is  obtained  from  I  as  follows: 

L  K 

(i)  if  the  read- and- look- ahe ad  instruction 

qXt  (X,q)q’t  is  in  I,  then  I’  contains  the  read 
instruction  qX  ^  (X,q)q*; 

(ii)  if  I  contains  a  look  ahead  instruction  qt  q*t, 
then  q'  is  a  reduce  state.  The  only  instructions 
applicable  to  q’  are  reduce  instructions.  If 
yq'  ^  q"x  is  in  I,  I’  will  contain  the  reduce 
instruction  yq  ^  q''x; 

(iii)  read,  accept,  and  reduce  instructions  in  I  are  als 
in  I  '  . 

In  general,  M’  is  simply  a  nonde terminis t ic  version  of  M. 
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Consider  a  D2SM,  M  =  (K,V  ,V  ,T,I),  with  the  following 

L  K 

(i)  Let  (X,q’)xq  ^  q’y  be  any  reduce  instruction.  For 
any  state  string,  u(X,q')xq,  there  is  another 
string,  zeV  *,  such  that 

u(X,q’)xqz  \-  uq’yz  [-  *  ACCEPT. 

(ii)  Let  qX  ^  (X,q)q*  be  any  read  instruction.  For  any 
state  string,  uq ,  there  is  another  string,  z,  such 
that 

uqXz  |-  u(X,q)q’z  |-  *  ACCEPT. 

(iii)  Let  qXt  ^  (X,q)q’t  be  any  r ead-and- 1 ook-ahead 

instruction.  For  any  state  string,  uq ,  there  is 
another  string,  z,  such  that 

uqXtz  |-  u(X,q)q'tz  |-*  ACCEPT. 

(iv)  Let  qt  ^  q't  be  any  look  ahead  instruction.  For 
any  state  string,  uq  ,  there  is  another  string,  z, 
such  that 

uqtz  |-  uq'tz  [-  *  ACCEPT. 

(v)  Let  M’  be  the  no  look  ahead  version  of  M.  Then 
T (M)  =  T(M’).  Informally,  this  means  that  look 

ahead  is  not  used  to  affect  the  set  of  strings 
accep  ted  by  M. 

An  immediate  consequence  of  these  constraints  is  that 
for  any  state  string,  uq,  there  is  a  string,  z,  such  that 

uq  z  [-  *  ACCEPT  . 

Together  these  restrictions  imply  that  until  an  error  is  found 
the  portion  of  the  original  input  that  has  not  been  read  or 
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used  in  a  look  ahead  could  lead  to  M  accepting.  Furthermore, 
errors  are  detected  only  when  reading  or  looking  at  the 
first  symbol  of  the  unread  input.  Thus  M  has  the  error 
property  and  will  be  called  a  D2 SM  with  the  error  property . 

A  2SM  which  has  only  accept,  read,  and  reduce  instructions 
and  which  meets  conditions  (i)  and  (ii)  of  a  D2SM  with  the 
error  property  will  be  called  an  error  property  2 SM  wi thout 
look  ahead .  These  rather  special  2SM's  will  be  useful  in 
Chapter  7 . 


A  D2SM  with  the  error  property  has  all  the  desirable 
features  outlined  in  Chapter  3.  The  next  theorem  proves  that 
the  error  property  guarantees  that  a  D2SM  will  halt  for  all 
inputs  . 


Theorem  5 . 1 


Let  M  be  any  D2SM  with  the  error  property.  M  will 
halt  on  all  inputs. 


Proof 


Assume  that  there  is  an  input,  w,  for  which  M  does  not 
halt.  Surely  w  is  not  an  element  of  T(M).  Since  M  is 
deterministic,  if  w  is  in  T (M) ,  then  the  unique  parse 
of  w  is 
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q^w  I-  *  ACCEPT  . 
Thus  M  would  halt. 


Since  M  does  not  halt  for  all  inputs,  there  is 
a  particular  string  xytu  such  that 

q^xytu  \-  *  P(x)qytu  |—  ^  P  ( x )  z  ^  ^  q  ^  '  z  ^  2  ^  u 

for  all  i  >  0.  That  is,  M  makes  an  infinite  number  of 
moves  without  ever  using  P(x)  or  reading  tu  (t  is  used  as 
look  ahead  only.  u  is  not  used  at  all)  . 

We  will  show  that  M  must  have  an  inadequate  state 
by  demonstrating  that  M  operates  nonde t e rmi ni s t i c a 1 ly  . 


Let  M’  be  the  no  look  ahead  version  of  M.  Clearly, 
M’  will  not  halt  when  xytu  is  used  as  input.  In  fact. 


q  xytu  \-  *  P(x)q’ytu  \-^  P  (  x  )  z  .  ^  q  .  "  z  .  ^  t  u 
i  M'  M' 

for  all  i  >  0.  M’  will  still  have  properties  (i)  and  (ii) 
of  a  D2SM  with  the  error  property.  Moreover  M  has  read, 
accept,  and  reduce  instructions  only.  Thus  for  any  two 
integers  i^  and  i„,  there  are  strings  u.  and  u.  such 

'  If  I2 

that 
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q^xyu.  |-*P(x)q'yu.  |-  P(x)z.  q.  ”  z  ^u.  \-  ACCEPT 

^  iM’  IM’  ’■I 

and 

q  xyu.  [-  *P(x)q’y^i  [-  ^^P(x)z.  q.  "z.  u.  [-  "^ACCEPT  . 

^  ^2m'  2m'  '■2^  ^2  ^2^  ^2M' 


Let  k  be  an  integer  chosen  so  that,  for  any  instruct! 


qs  q’s  or  qXs  ->  (X,q)q's  in  M,  |  s  |  <k  .  There  are  two 


particular  values  of  i^  and  (i^  ^  i^)  such  that 


FIRST (u. 

^1 

,k)  = 

FIRST(u.  ,k).  Let  us  write  FIRST(u.  ,k)  =  t 

^2  ^1 

and  u ,  = 

tul 

and  u,  =  tu!  . 

"1 

"2 

I2  I2 

Thus  we  may  conclude  that 

xyu.  and  xyu,  are  accepted  by  M'  and  hence, 

il  I2 

by  definition,  are  accepted  by  M; 

the  parse  by  M  of  xy  in  xytul  differs  from  the  parse 

^1 

of  xy  in  xytu^  ; 

M  cannot  look  beyond  the  end  of  t  during  the  parse  of 
xy  . 

Consequently,  M  must  have  some  inadequate  state 
(this  state  will  be  encountered  somewhere  in  the  non-halting 
parse).  M  is  a  D2SM  and  can  have  no  inadequate  states,  a 
contradiction.  M  must  he  a  halting  D2SM.  □ 

In  fact,  we  have  proved  more  than  required.  We  have  not  uset 


(i) 


(ii) 


(iii) 
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constraints  (iii)  and  (iv)  on  D2SM’s  with  the  error  property. 
The  corollary  below  follows  immediately. 

Corol lary 

Let  M  =  (K,V, ,V  ,T,I)  be  any  D2SM  with  the 

L  K 

error  property.  Let  M’  =  ( K  ,  ,  V  ,  T  ,  I  *  )  be  any  D2SM 
obtained  from  M  by  adding  new  instructions  qt  q't  or 
qXt  ->■  (X,q)q’t  where  qt'  ->■  q’t’el  or  qXt'  ^  (X,q)q't’el, 
respectively,  for  some  t’.  M'  has  extraneous  look  aheads  , 

M'  is  a  halting  D2SM. 

Proof 

We  proved  this  in  the  last  theorem.  □ 

The  class  of  languages  parsable  by  a  D2SM  with 
the  error  property  can  also  be  characterized  in  terms  of 
grammars  and  the  concept  of  regular  parsability. 

Regular  Parsable  Grammars 


We  begin  with  several  definitions.  Let  G  =  (N,T,P,S) 

be  any  TOG  and  let  G^  =  ^ A ’ ^ A ’ k-augmented 

grammar  constructed  from  G. 

(i)  characteristic  strings  -  These  were  defined  in  Ghapter 
3  and  we  repeat  the  definition  here. 

GS(G)  =  {wAp  I  S  =>*  u^yu2  =>  Uj^vu2,u^v  =  w,  and 
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(ii) 


(iii) 


(Iv) 


the  production  used  in  the  last  step  is  y  vAp}. 


k 


Clearly,  CS(G^)  =  {Si  AO}  u  CS(G). 


restricted 


strings  -  This  set  was 


also  defined  in  Chapter  3. 

RCS(G)  =  {wAp|  wApeCS(G)  and  S  u  yu 


1 


=  >  u  vu 
1  2 


=  >* 


u^v  =  w,  zeT*,  and  y  ^  vApeP}. 


By  definition,  RCS(G)  ^  CS(G) 


k-augmented  set  of 


s  tr ings-  We  defined  this 


set  for  context  free  grammars.  We  extend  the  definition, 
in  slightly  more  general  terms,  to  all  TOG’s.  Any  set  S 
such  that 

(a)  if  (wAp,t)eS,  then 


S  =>* 
A 


j  u 

2  1 

v=w , 

t  1  <k. 

and 

vsu^  , 

u^v  = 

s  1  =k , 

then 

(wAp,t)eS  such  that  t€PREFIX(s), 
is  a  k-augmented  set  of  characteristic  strings.  Such  a  set 
is  denoted  ACS(G,k). 

k-augmented  set  of  restricted  characteristic  strings  - 

This  set  is  defined  analogously  to  ACS(G,k).  Any  set  S 


such  that 


(a)  if  (wAp,t)eS,  then  S  e>*  Uj^ytU2  Uj^vtU2  i 

A 


E  >  Z 


z  cTa*  ,  u^v=w,  the  production  used  in  deriving  Uj^vtu^ 
was  y  V  Ap  and  1 1  |  <k ,  and 

(b)  if  S.  E >*  u  ysu  E>  u  vsu  e>  z,  z eT  * ,  u,v=w,  the  last 
A  12  12  A  ’  1  ’ 


production  used  in  deriving  u^vsu^  was  y  v  Ap  and 


s  l=k,  then  there  is  an  element  (wAp,t) eS  such  that 
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tePREFIX(s) , 

is  a  k-augmented  set  of  restricted  characteristic  strings. 
Such  a  set  is  denoted  ARCS(G,k) . 

In  either  of  the  above  sets,  the  string  t  in  (wAp,t) 
is  called  a  look  ahead  string. 

In  the  definition  of  ACS(G,k)  for  CFG’s,  we  required  that 

Ic 

all  look  ahead  strings  were  k  symbols  long  (except  in  (Si  A0,e)). 
The  general  definition  uses  k  as  a  maximum  length  only.  The 
notation  ACS(G,k)  or  ARCS(G,k)  does  not  define  a  specific  set 
but  refers  to  any  set  that  satisfies  the  above  definitions. 

A  grammar,  G,  is  regular  parsable (RP)  if  and  only  if 
CS(G)  is  a  regular  set.  Similarly,  a  grammar,  G,  is 

restricted  regular  parsable  (RRP)  if  and  only  if  RCS(G)  is  regular. 

G  is  deterministic  regular  parsable  (DRP)  if  and  only  if 
for  some  positive  integer  k  there  is  a  set  ACS(G,k)  that 
satisfies  the  following  restrictions. 

(i)  For  any  two  elements  of  ACS(G,k),  (w^Ap^,t^)  and 
(w2Ap2,t2),  ^  prefix  of  ^2^2 

w^t^^PREFIX(w2t2) ) . 

(ii)  Let  T(Ap)  =  {t  |  (w Ap , t ) e AC S ( G , k ) }  and  let  T’  be  any 
subset  of  T(Ap).  The  set 

S(T’)  =  {wAp  I  (wAp , t ) £ AC S (G , k)  for  all  teT’}  is  regular. 

Similarly,  G  is  deterministic  restricted  regular 
parsable  (DRRP)  if  and  only  if  for  some  positive  integer  k,  there 
is  a  set  ARCS(G,k)  that  satisfies  the  following  restrictions. 
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(i)  for  any  two  elements  of  ARCS(G,k),  (w^Ap^jt^)  and 
(w2Ap2,t2)  ,W^t^s^PREFIX  (w^  t^)  • 

(ii)  Let  T(Ap)  =  {t  |  (wAp , t ) e ARC S ( G , k ) }  and  let  T' 
be  any  subset  of  T(Ap) .  The  set 
S(T')  =  {wAp  I  (wAp , t ) eARCS (G ,k)  for  all  teT'} 

isregular. 

A  language  L  is  RP ,  RRP ,  DRP ,  or  DRRP  if  and  only 
if  there  is  a  grammar  G  such  that  L  =  L(G)  and  G  is  RP , 
RRP,  DRP,  or  DRRP  respectively. 

Knuth  (Knuth  1965)  and  De  Remer  (De  Remer  1969)  have 
obtained  results  which  show  that 

(i)  all  CFG’s  are  RP .  Thus  all  CFL  '  s  are  RP ; 

(ii)  all  CFG’s  and  CFL ' s  are  RRP.  If  a  CFG,  G,  has  no 
no  useless  nonterminals  (a  nonterminal  is  useless 
if  it  does  not  derive  a  string  of  terminals) ,  then 
C3 (G)  =  RCS (G) ; 

(iii)  a  CFG  with  no  useless  nonterminals  is  LR(k)  if  and 
only  if  it  is  DRP.  Furthermore,  such  a  grammar  is 
LR(k)  if  and  only  if  it  is  DRRP. 

Not  all  CSG’s  are  RP ,  let  alone  DRP  or  DRRP.  For 
example,  consider  the  grammar  G  =  (  { S  ,  B  ,  C } , { a  ,  b , c } , P , S } 

with  productions 
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s 

a  S  B  c  A  1 

s 

aBCA2 

a  B  ->• 

abA3 

bB  ^ 

bb  A4 

b  C 

b  c  A  5 

cG  ^ 

c  c  A  6 

CB 

BGA7. 

G  generates  the  CSL,  {a^b^c^  |  n>l}.  The  characteris 

s  t rings  are 

CS(G)  =  {a*aSBCAl}  u  {a*aBCA2} 

u  {a*abA3}  u  {a’^b^bbAA  |  n>m+2,m>0} 

u  {  a^b  ™b  c  A5  I  n>ni+ 1  ,  m^O  }  u  {a^b™c^ccA6  1  n>m>p  +  2} 
u  {a’^B^BCA?  I  n>m+l}  u  {a’^SB^BCA7  |  n>m+l}. 

CS(G)  is  not  regular.  However,  consider  a  modified 
version  of  G,G’,  with  productions 


S 

ABScAl 

S  ^ 

Ab  c  A2 

Ab 

ab  A3 

Aa 

aa  A4 

Bb  -> 

bb  A5 

BA  ^ 

ABA6. 

G'  and  G  generate  the  same  language,  but 

CS(G’)  ={ (AB) *ABScAl }  u  {(AB)*AbcA2} 
u{ (AB)*A*abA3}  u  { ( AB ) * A* aaA 4 } 

U  {  (AB)  *A*AAbbA5}  u  {  (  AB  )  *A*AAB  A6  }  . 
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This  set  is  regular  and,  in  fact,  G’  is  both  RP  and  RRP , 
since  RCS(G’)  =  CS(G')  -  {  ( AB )  "^A^Aab  A3  }  -  {  ( AB )  "^A*  aa  A4  }  . 
The  sets  ACS(G',0)  and  ARCS(G',0)  satisfy  the  DRP  and  DRRP 
definitions  respectively.  Thus  G*  is  also  DRP  and  DRRP. 

These  examples  show  that  some,  but  not  all,  GSG*s 
are  RP  (RRP,  DRP,  DRRP).  Furthermore,  there  is  a  language 
({a^b^c^  I  n>l})  that  is  RP ,  RRP,  DRP,  DRRP,  but  neither 
LR(k)  nor  context  free. 

Relations  Between  RP  and  DRP  Languages 

It  is  easy  to  show  that  all  DRP  languages  are  also 

RP  . 

Theorem  5 . 2 

If  a  grammar,  G  =  (N,T,P,S),  is  DRP,  then  it  is  also 

RP  . 

Proof 

There  is  a  set  ACS(G,k)  that  satisfies  the  DRP 
definition.  Let  the  sets  T(Ap)  and  S(T')  be  defined  as 
above  and  let 

T (Ap)  =  { t^  ,  .  .  .  , t^} . 

If  G^  is  the  k-augmented  grammar  derived  from  G  then  let 
CS(G^,Ap)  =  {wAp  I  wApeCS(G^)}.  By  definition  of  S(T’) 
CS(G^,Ap)  =  S({t^})  u  ...  u  S({t^}). 
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Each  S({t^})  is  regular.  Regular  sets  are  closed  under 
union  (Hopcroft  and  Ullman  1969).  CS(G^)  is  a  finite  union 
of  the  sets  CS(G^,Ap).  Thus  CS(G^)  is  regular.  Since 
CS(G^)  =  GS(G)  u  {Si^aO}, 

CS(G)  =  CS(G^)  n  {Si^AO}. 

Regular  sets  are  closed  under  complementation  and  intersection 
(Hopcroft  and  Ullman  1969).  Thus  CS(G)  is  regular  and  G 
i s  RP  .  □ 

By  using  similar  reasoning  we  can  show  that  if 
G  is  DRRP ,  then  G  is  RRP .  Thus  we  state  without  proof 

Theorem  5 . 3 


If  a  grammar  G  is  DRRP,  then  G  is  also  RRP.  □ 


It  is  well  known  that  not  all  CFG’s  are  LR(k)  for  any 
k.  The  following  corollary  should  be  evident. 


Corol lary 


There  are  RP  grammars  that  are  not  DRP  and  RRP 
grammars  that  are  not  DRRP. 


Proof 


The  grammar  G  =  (  { S  }  , { a , b }p , S  )  where  P  is 
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P:  S  aSa  L(G)  =  {ww^  |  we{a,b}*} 

S  ^  bSb 
S  ->■  aa 
S  ^  bb 

is  RP  and  RRP .  It  is  not  LR(k)  for  any  k  (Hopcroft 
and  Ullman  1969).  Thus  G  is  not  DRP  or  DRRP .  □ 

Finally,  in  the  language  domain  we  have 

Theorem  5 . 4 

If  a  language,  L,  is  DRP  or  DRRP  then  L  is  RP 
or  RRP  respectively. 

Proof 


If  L  is  DRP  or  DRRP,  then  L  =  L(G)  for  some 
grammar  G  that  is  DRP  or  DRRP  respectively.  From  Theorems 
5.2  and  5.3,  G  is  also  RP  or  RRP.  By  definition  L  is 
RP  or  RRP  respectively.  D 


The  Equivalence  of  DRP  Languages  and  D2SM's  with 

the  Error  Property 

The  class  of  DRP  languages  is  equivalent  to  the 
class  of  languages  recognized  by  a  D2SM  with  the  error 
property.  We  prove  this  by  the  following  series  of 
lemmas  and  theorems. 
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Theorem  5 . 5 


Let  L  be  any  DRP  language.  Thus  L  =  L(G)  for  some 
DRP  grammar  G.  Assume  the  set  ACS(G,k)  satisfies  the 
DRP  definition.  There  exists  a  D2SM,  M,  with  the  error 
property,  that  accepts  the  language  L(G){i^}. 

Proof 

The  set  ACS(G,k)  can  be  used  to  construct  a  D2SM 
that  recognizes  L.  The  construction  is  straightforward 
and  is  analogous  to  the  construction  of  an  LR(k)  parser 
from  AGS(G,k)  when  G  is  a  CFG.  We  will  only  give  an 
informal  description. 

Let  G^  be  the  k-augmented  grammar  derived  from  G. 

CS(G^)  is  a  regular  set  (Theorem  5.2).  We  construct  a 
D2SM,M,with  a  finite  control  based  on  CS(G  ). 

For  any  apply  symbol,  Ap,  define  the  set 
T(Ap)  =  {t  I  (wAp , t ) eACS (G ,k) } . 

T  (A p )  is  the  set  of  all  look  ahead  strings  that  may  follow 
the  right  side  of  production  p.  If  T*  is  any  subset  of  T(Ap), 
then  define 

S(T')  =  {wApI  (wAp,  t)£ACS(G,k)  for  all  te'T'}. 

Finally,  let  CS(G^,Ap)  =  {wAp  |  wAp£CS(G^)}.  As  a  direct 
result  of  the  definition  of  ACS(G,k),  there  are  subsets 
T^,..,,T^  of  T(Ap)  such  that 


5-15 


(i)  CS(G^,Ap)  =  S(T^)u  ...  uS(T^);  j 

I 

(ii)  S(T.)  is  regular  for  all  j;  j 

^  1 

(iii)  S(T.)nS(T  )  =  0  if  j  5^  1;  and 

J  ^ 

(iv)  T(Ap)  =  u  ...  u  T^. 

i 

The  finite  control  of  M  is  designed  so  that,  for  every  I 

I 

set  S(T.),  there  is  a  state  q.  with  aAp  transition  so  I 

J  J  I 

that  i 

J 

qiX^...Xn  h  (X^.q^)...(X^.q^)q.  ^ 

if  and  only  if  X...XAp6S(T.)(X  =  e  if  the 

1  n  ^  J  n 

p^^  production  is  u  ^  eAp) . 

Inadequate  states  are  resolved  using  the  look  ! 

ahead  strings  from  ACS(G,k).  By  construction  and  from 
the  DRP  definition,  M  is  a  D2SM  with  the  error  property. 

I 

i 

From  Theorem  3.2,  M  accepts  the  language 
L(G^)  =  L(G){i^}.  □ 

To  prove  the  converse  of  this  theorem,  we  will 
construct  a  DRP  grammar  from  any  D2SM  with  the  error  propert 
We  will  describe  the  construction  and  then  show  that  the 
resultant  grammar  is  DRP. 

Let  M  =  (K,V^  ,Vj^,T  ,  I)  be  a  D2SM  with  the 

error  property.  T  is  the  terminal  set  of  M.  The 
constructed  grammar  will  be 

G^  =  (V^u(K-{q^})u(V^-T)u{A},Tu{q^},P,A) 
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where  P  =  { x  y 


y  xe  I  and  y 


X  is  not  the  accept 


instruction}  u  {A  I  ACCEPTel}- 

Note  that  is  already  in  0-augmented  form. 

Lemma  5 . 1 

A  =>*  q ,w  if  and  only  if  w  is  accepted  by  M. 

Gm 

Proof 

Trivially  true  from  the  construction  of  G^ .  □ 

Thus  L(Gj^)  =  {q^}  L  (M)  .  Clearly  G^  can  be  modified 
so  that  it  generates  exactly  L (M)  without  the  leading 
q^.  Adding  the  production  q^  e  is  a  simple  method  of 
achieving  this. 

The  next  series  of  lemmas  proves  that  G^^  is  a  DRP 
grammar.  In  particular,  the  connection  between  state 
strings  of  M  and  characteristic  strings  of  G^^  is  established. 


Lemma  5 . 2 

If  A  =>*  uqw,  then  uq  is  a  state  string. 

Proof 

Since  A  =>*  uqw,  uqw  [-*  1“  ACCEPT.  By  detinitioii,  for 

any  reduce  instruction,  and  for  any  state  string 

zq',zyq^  is  also  a  state  string.  If  uq  is  not  a  state  string, 
then  at  some  reduce  state  in  the  parse,  uqw  [-*  q^^S,  a  reduce 
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instruction  must  have  been  encountered  that  violated  this 
constraint  on  D2SM’s.  Thus  uq  is  a  state  string.  □ 

The  converse  is  also  true. 

Lemma  5 . 3 

If  uq  is  a  state  string,  then  A  =>*  uqw  for  some  w. 


Proof 

The  definition  of  a  D2SM  with  the  error  property 
guarantees  that  for  any  state  string,  uq  ,  there  is  some 
string  w  such  that  uqw  [-*  q^S  |-  ACCEPT.  Thus  A  =>*  uqw  as 
required .  □ 

Thus  uq  is  a  state  string  if  and  only  if  A  =>*  uqw  for 
some  w.  We  can  prove  more  than  this. 

Lemma  5 . 4 

(i)  Let  q’y  (X,q')xqAl  be  in  P.  u(X,q’)xq  is  a  state 
string  if  and  only  if  u (X , q ' ) xq A  1 e C S ( G^)  . 

(ii)  Let  (X,q)q’  ->■  qXAl  be  in  P. 

uq  is  a  state  string  if  and  only  if  uqX  A 1  e  C  S  ( . 

(iii)  Let  (X,q)q't  ^  qXtAl  be  in  P. 

uq  is  a  state  string  if  and  only  if  uqX  t  A 1  e  C  S  ( G^^)  . 

(iv)  Let  q't  qtAl  be  in  P. 

uq  is  a  state  string  if  and  only  if  uq  t  A 1  c  C  S  ( G^^^)  . 
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Proof 


Observe  that  all  derivations  under  G.,  are  canonical. 

-  M 

(i)  This  is  an  immediate  consequence  of  Lemmas  5.2  and  5.3. 

(ii)  If  A  =>*  uqXz,  then  uq  is  a  state  string  (Lemma  5.2). 

Let  uq-be  any  state  string.  Since  there  is  an 
instruction  qX  (X,q)q’,  there  exists  a  string  z 
such  that 

uqXz  I-  u(X,q)q’z  (-*  ACCEPT. 

Therefore,  A  =>*  u(X,q)q’z  =>  uqXz  and  uqXAl  is  in 
CS(G^) . 

(iii)  If  A  =>*  uqXtz,  then  uq  is  a  state  string  (Lemma  5.2). 

Let  uq  be  any  state  string.  I  contains  the 
instruction  qXt  ^  (X,q)q’t.  By  definition,  there  is 
a  string  z  such  that 

uqXtz  |-  u(X,q)q'tz  |-*  ACCEPT. 

Therefore,  A  E>*  u(X,q)q'tz  uqXtz  and  uqX  t  A  1  e  C  S  ( G^^)  . 

(iv)  If  A  =>*  uqtz,  then  uq  is  a  state  string  (Lemma  5.2). 

Let  uq  be  any  state  string.  There  is  a  look 
ahead  instruction,  qt  q  ’  t  ,  in  I.  By  definition,  there 
is  a  z  such  that 

uqtz  \-  uq’tz  [-*  ACCEPT. 

Thus  A  =>*  uq’tz  =>  uqtz  and  uqtAleCS(G^) .  □ 

The  following  lemma  develops  the  exact  relation 
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between  the  state  strings,  C’,  and  the  characteristic 
strings  of  .  It  will  be  shown  that,  since  C’  is  regular,  i 
CS(G^)  is  also  regular  and  hence  G^^  is  regular  parsable. 

Lemma  5 . 5 

G,,  is  RP.  i 

M 

[ 

I 

i 

Proof  ; 


Recall  that  all  derivations  in  G..  are  canonical. 
-  M 


Consider  any  derivation 


M  M 


In  terms  of  M,  q  is  a  state  that  can  be  reached  from  q'  by  ' 
the  application  of  one  of  M’s  instructions.  This  instruction 


is  a  read  instruction,  a  reduce  instruction,  a  read  and  look 


ahead  instruction,  or  a  look  ahead  instruction.  The  relation 


ship  between  characteristic  strings  and  state  strings  is  best 


discussed  in  terms  of  these  four  cases. 


If  M  is  in  a  state  q. ,  it  will  be  in  one  of  the  state 

^  1 


il  ’  *  •  • 


in 


after  one  instruction  is  used.  Thus  the  state 


strings  ending  in  q^  (written  C'(q^))  can  be  expressed  in 

terms  of  C ’( q C ’( q .  ).  Similarly,  we  can  subdivide 

i-L  In 

CS(G^)  into  sets  CS(G^,q_.)  which  contain 


m 


strings  with  a  q^  in  them.  CS(Gj^,q^)  can  also  be  expressed  i 


terms  of  C’(q^.).  If  the  instructions  of  M  are  numbered  in 
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so  le  .  .'bitrary  order,  then  we  write 

C’(q^)  =  ^  •••  ^  )  u  ...  u  A(q^  ) 

1  n 

wh ere  A(q..,  1,)  is  defined  in  terms  of  C’(q..)  and  the 

1  j  K  1  j  k. 

instruction.  q..  is  reached  from  q.  by  using  the  instruction, 

ij  1  k 

Similarly 


CS(Gj^,q^)  -  B(q^^,l^^)  u  ...  u  ^(^£n*^im 

n 

where  B(q..,l,  )  is  defined  in  terms  of  A(q..,l,  ).  B(q..,l,  ) 

^ij  k  ij  k  ij  k 

will  in  fact  be  all  characteristic  strings  containing  a  q. 


and  ending  in  the  apply  symbol  corresponding  to  the  1 


th 


ins  t rue tion . 


For  example,  assume  the  following  instructions  are 


applicable  to  state  q. 

qX^  ^  (X^,q)q2 
qX2t  ^  (X2,q)q3t 

(X3,q^)y3q  ^4^1 

(X^,q3)y2q  ^5^2 

Then 


instruction  (1) 
(2) 

(3) 

(4) . 


and 


C’(q)  =  ACq^.l)  u  A(q2,2)  u  A(q^,3)  u  A(qj,4) 


A(q2.1) 

A(q3,2) 

A(q4,3) 

ACqj.A) 


=  (C'Cqj)  /  {(Xj,q)q2}){q} 

=  (C'(q3)  /  {(X2,q)q3}){q} 


(C'(q^)  /  {q^})  {  (X3  ,q^)yj^q} 

'2' 


(C'(q3)  /  {q^}) { (X^ ,q3)y,q} 


The  general  definitions  of  A(q,,,l,  )  and  B(q..,l,  ) 

ijk  ^ijk 

depend  on  the  four  cases  discussed  above.  For  convenience^ 
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we  will  assume  that  the  1^  instruction  is  given  the  applj 

1 

symbol  Al  when  it  is  considered  as  a  production.  j 

I 

The  following  properties  of  regular  sets  (Salomaa  19/ 
Hopcroft  and  Oilman  1969)  are  used  imp licitly  in  the  rest  j 

I 

i 

of  this  proof;  ; 

(i)  a  finite  set  is  regular;  I 

j 

(ii)  regular  sets  are  closed  under  the  operations  of  . 

) 

concatenation  and  right  quotient  (/). 

Case  1  -  A  e>*  xq^jyz  E>  x ( X , q ^ ^ ) uq ^ z . 

Here  x ( X , q ^ ^ ) uq ^ e C ’  ( q ^ )  and  x ( X , q ^ ^ ) uq ^ A  1 e C S ( G^)  .  Th 
applied  production  was  ^  (X , q ^ j ) uq ^ A  1 ,  the  reverse  of 

reduce  instruction.  Thus  q^j  is  reached,  from  q^,  by  this 
instruction.  Here 

A(q^j,l)  =  (C'(q^j)  /  {q^j}){ (X,q^j)uq^} 

and  B(q^j,l)  =  A ( q ^ ^  , 1 ) { A  1  }  . 

Since  C’(q^j)  is  regular,  B(q^j,l)  is  regular. 


Case  2  -  A  ->*  x(X,q.)q. .w  =>  xq . Xw . 

- 


The  production  used  was  (X,q^)q^j  q^XAl  ,  the  revers 


of  a  read  instruction.  xq^XAl  is  a  characteristic  string 


and  xq^  is  a  state  string.  q^j  is  accessed  from  q^  by  this 
read  instruction.  Consequently, 


A(q^j,l)  =  (C'(q^j)  /  { (X,q^) q^ j }) {q^} 


and  BCq^jjl)  =  A ( q ^ ^ , 1 ) { X Al } .  Once  again  A(q^j,l)  and 


B(q^j,l)  are  regular  sets. 


Case  3  -  A  =>*  x(X,q.)q.  .  uw  :=>  xq.Xuw. 

-  1  ij  1 


In  this  case,  xq^XAl eCS (G^)  and  xq^£C’.  The  production 
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used  was  (X,q.)q.  .u  ->  q.XuAl,  the  reverse  of  a  read  and 

look  ahead  instruction  that  accesses  q..  from  q.. 

1 J  1 

Thus  , 

A(q.  1)  =  (C'(q..)  /  { (X , q . ) q .  .  }  ) {q • } 

1  j’  1 J  1  1 J  1 

and 

B(q^j,l)  =  A(q^j){XuAl}. 

A(q..,l)  and  B(q^.,l)  are  regular  sets. 

1 J  J 

Case  4  -  A  =>*  xq . . uw  =>  xq . uw . 

-  ij  1 

Here  xq  ^  u  A  1  e  C  S  ( and  xq^eC’.  The  production  ^  q^uAl 

corresponds  to  the  look  ahead  instruction  q^u  q^^u. 

Hence , 

A(q^j,l)  =  (C'(q^j)  /  {q^j}){q^} 

and 

BCq^jjl)  =  A(q^j){uAl}. 

A(q..,l)  and  B(q..,l)  are  still  regular. 

^1  ^  J 

CS(Gj^)  also  contains  the  special  characteristic 
string  q  ^SAp  corresponding  to  the  use  of  A  ->  q^^S. 

CS(Gj^,q^)  is  a  finite  union  of  the  regular  sets 
B(q^j,l)  and  CS(G^)  is  a  finite  union  of  CS(G^,q^)’s.  Regular 
sets  are  closed  under  finite  union  (Hopcroft  and  Ullman  1969) 
and  so  CS(Gj^)  is  regular.  Thus  G^  is  regular  parsable.  □ 

The  following  corollary  will  be  useful  in  Chapter  7 
and  is  an  immediate  consequence  of  the  previous  lemma. 
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Corollary 


Let  M  be  an  error  property  2SM  without  look  ahead 
and  let  be  the  grammar  constructed  from  M  as  above, 
is  an  RP  grammar.  □ 

It  remains  to  be  shown  that  G..  is  DRP  . 

M 


Theorem  5 . 6 


G^  is  deterministic  regular  parsable. 


Proof 


In  fact,  CS(G^)  requires  no  look  ahead  strings. 
Note  that  no  string  in  CS(Gj^,q^)  can  be  a  prefix  of  a 
string  in  CS(Gj^,q.),  where  i  #  j,  since  the  first  string 


contains  a 


q^  and  the  second  a  q^ .  Thus,  we  need  only 


consider  two  strings  in  some  set  CS(Gj^,q^)  to  check  for 
determinism. 


Assume  q^  is  a  reduce  state  (and  so  has  only  reduce 
instructions).  In  this  case, 

C'(q^)  =  A(q^^,l^)  U  ...  U  A(q^^,l^) 

and 


1  .) 
J 


(C'(q^j)  /  {q^j  })  {  (X,q^  Oyq^}  . 


Since 

j  ^  k 


M  is  a  D2SM, 

C'(q^j)  /  n  C’(q^^)  /  for  all 

and  so  all  of  A(q..,l,)  are  disjoint.  Now,  since 
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stri  gs  in  CS(Gj^,q^)  are  merely  strings  in  one  of  A(q^j,lj) 
with  an  apply  symbol  appended  to  them,  there  cannot  be  two 
strings  in  CS(Gj^,qj,)  such  that  one  is  a  prefix  of  the  other. 


If  q.  is  not  a  reduce  state,  it  must  be  a  read  state 
with  only  look  ahead,  read,  r e ad- and— 1 o ok- ahe ad ,  or  accept 
instructions  applicable.  In  this  case,  A(q..,l)  =  A(q.,,l’) 

1  J  1  tC 

for  all  j,  k,  1,  and  1'.  Let  us  call  the  common  value  Z. 

All  strings  in  Z  end  in  q^  and  so  none  can  be  a  proper  prefix 
of  another.  Now,  each  set,  B(q^j,l),  can  be  written 


B  (q^ j , 1)  =  Z {XAl} 

Z {YuAl } 
Z {uAl} 


1  is  a  read  instruction 

1  is  a  read  and  look  ahead 

ins  true tion 

1  is  a  look  ahead  instruction. 


Since  M  is  deterministic,  none  of  X,  Yu,  or  u  can  be  prefixes 
of  each  other.  For  state  q^,  the  special  characteristic 
string  q^SAp  corresponding  to  the  production  A  ^  q^SAp  is 
included.  S  cannot  be  a  prefix  of  any  of  X,  Yu,  or  u  since 
M  is  a  D2SM. 


If  G^^  is  the  0-augmented  grammar  constructed  from  G^ 
with  goal  symbol  then  the  set 

{(wAp,e)  I  wApeCS(G^)}  u  {(AAO,e)} 


is  a  set  of  0-augmented 


c  strings  and  satisfies 


the  DRP  definition.  Thus  G^,  is  DRP  .  □ 

M 


Since  G^  is  DRP,  L(G^)  is  a  DRP  language.  We  can  also 


show  that  L (M)  is  DRP. 


M 
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Theorem  5 . 7 


Let  M  =  (K,V  ,V  ,T,I)  be  any  D2SM  with  the  error 

L  R 

property.  The  language  accepted  by  M,L(M),  is  DRP . 


Proof 


We  construct  the  DRP  grammar  as  in  the  proof  of 

the  last  theorem.  If  G  =  (V  u  (K-{q  })  u  (V  -T)  u  {a} ,1 

ML  1  R 

{q^},P,A)  then  construct  the  grammar 

G^’  =  (V,  u  K  u  (V^-T)  u  {a},T,P’,A).  If 
ML  R 

F  =  {X  I  X eV  and  M  has  an  instruction  of  the  form 
q^Xt  .  .  .  } 
then  P’  is  defined  by 

P'  =P  u{qX->  XAp  I  XeF}  . 

J.  A. 

Thus  CS(Gj^')  =  CS(G^)  u  {XAp^  f  XeF}.  Regular  sets  are 

closed  under  union  (Hopcroft  and  Ullman  1969)  and  so 

G^'  is  RP  . 

M 

All  strings  in  CS(G^)  begin  with  q^  or  a  symbol 

in  V_  .  Thus,  we  can  easily  show  that  G. ,  ’  is  DRP.  In 
L  M 

fact,  the  set 

{(wAp,e)  I  wApe  GS  (Gj^ '  )  }  u  {(AAO,e)} 
is  a  set  of  0-augmented  characteristic  strings  for^G^' 
and  satisfies  the  DRP  definition.  Since  XeF  if  and  only 
some  string  beginning  in  X  is  accepted  by  M,  L  ( G^^)  =  { q  }  L  (G 
Therefore,  L  (M)  =  L(Gj^’)  is  DRP.  □ 
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We  have  shown  the  equivalence  of  the  DRP  concept  and 
D2SM’s  with  the  error  property.  Thus  we  can  explore  the 
class  of  DRP  languages  by  either  approach.  A  similar  result 
holds  for  DRRP  languages. 

DRRP  Languages  and  D2SM*s 

Consider  a  D2SM,  M  =  (K,V  ,V  ,T,I),  with  the  following 

L  K 

characteristics . 

(i)  Let  (X,q’)xq  q'y  be  any  reduce  instruction.  For 

any  state  string,  u(X,q’)xq,  there  are  strings, 

zeV*  and  weT*,  such  that 
R 

q^w  |-  *  u(X,q’)xqz  \-  uq'yz  |-  *  ACCEPT  . 

(ii)  Let  qX  (X,q)q’  be  any  read  instruction.  For  any 

state  string,  uq ,  there  are  strings,  zeV  *  and  weT*, 
such  that 

q^w  '  1-  *  uqXz  |-  u(X,q)q'z  \-  *  ACCEPT  . 

(iii)  Let  qXt  ^  (X,q)q't  be  any  r e ad- and- 1 o ok-ah e ad 

instruction.  For  any  state  string,  uq ,  there  are 

strings,  and  weT*,  such  that 

q^w  |-  *  uqXtz  \-  u(X,q)q’tz  |-  *  ACCEPT  . 

(iv)  Let  qt  q’t  be  any  look  ahead  instruction.  For  any 

state  string,  uq ,  there  are  strings,  zeV  *  and  weT*, 

K 

such  that 

q^w  \-  *  uqtz  [—  uq’tz  [—  *  ACCEPT  . 

(v)  Let'M’  be  the  no  look  ahead  version  of  M.  Then 

L(M)  =  L(M').  As  for  D2SM’s  with  the  error  property. 
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this  restriction  means  that  look  ahead  is  not  used 
to  affect  the  language  accepted  by  M. 

These  constraints  imply  that,  for  any  state  string,  uq 

there  are  strings,  zeV  *  and  weT*,  such  that 

K 

q^w  |-  *  uqz  1-  u'q’z'  \-*  ACCEPT. 

M  is  called  a  D2  SM  with  the  restricted  error  property .  D2SM 

with  the  restricted  error  property  correspond  to  the  DRRP 
concept.  We  will  only  sketch  the  proofs  of  this  connection 
since  they  are  essentially  the  same  as  those  for  the  DRP  cas 


Theorem  5 . 8 


Let  L  be  any  DRRP  language.  Thus  L  =  L(G)  for  some 
DRRP  grammar  G.  Assume  that  the  set  ARCS(G,k)  satisfies  the 
DRRP  definition.  There  exists  a  D2SM,  M,  with  the  restricte 
error  property  that  accepts  the  language  L(G){l  }. 

Proof 


The  required  D2SM  is  constructed  from  the  set  ARCS(G,k 
using  the  construction  outlined  in  Theorem  5.5.  The  D2SM 
will  meet  the  necessary  requirements  by  definition  of  ARCS(G 
and  the  DRRP  property.  □ 
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Theorem  5.9 


If  L  is  the  language  accepted  by  a  D2SM,  M,  with  the 
restricted  error  property,  then  L  is  DRRP  . 


Proof 


We  construct  as  before.  Because  of  the  added 

M 

constraints  on  M,  for  any  state  string,  uq  ,  there  is  a  string, 
w,  such  that 

A  =  >  *  uqw  =  >  *  z 

where  z  is  in  T  *  (T  is  the  terminal  set  for  M)  .  Thus 

CS(G^J  =  RCS(G^J  and  G^,  is  DRRP. 

M  MM 

Gj^  generates  the  language  {q^}L.  If 
^M  ~  ^  ’  then  construct  a  new  grammar 

^M  *  ~  ^  ■t925’,T,P’,A).  p’  is  defined  by 

P’  =  P  u  {q^  ->  eAp}. 

If  F  =  {X  1  XeT  and  there  is  a  string  Xw  in  L(M)},  then  define 

ARCS(G  ',1)  =  {(wAr,e)  I  wAreRCSCG^J}  u 
M  ‘  M 

{(Ap,X)  I  XeF}. 

Arguing  as  we  did  in  Theorem  5. 7, we  will  see  that  ARCS(G,’,1) 

M 

satisfies  the  DRRP  definition  and  G. . '  is  DRRP.  Since 

M 

L(G^’)  =  L(M)  ,  L  (M)  is  a  DPvRP  language.  □ 
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The  distinction  between  the  DRP  and  DRRP  properties  | 

I 

should  be  clarified  by  our  discussion.  Let  M  =  (K,V  ,V  ,1,]! 
be  any  D2SM.  If  M  has  the  restricted  error  property,  then 
every  instruction  in  I  is  used  in  the  parse  of  at  least  one' 
string  in  T*  that  is  accepted  by  M.  If  M  has  the  error 
property,  then  every  instruction  in  I  is  used  in  the  parse  , 

of  at  least  one  string  accepted  by  M.  However,  there  may  be 

1 

instructions  that  are  not  used  in  the  parse  of  any  accepted 
stringinT*. 

Finally,  we  may  note  that,  since  the  D2SM’s  constructt 
in  Theorems  5.5  and  5.8  satisfy  the  requirements  of  Theorem 
the  following  theorem  follows  immediately. 

Theorem  5 . 9 

(i)  Let  M  be  any  D2SM  with  the  (restricted)  error 
property.  M  is  a  canonical  parser. 

(ii)  Let  G  be  any  DRP  or  DRRP  grammar.  G  is  unambiguoi 

Undecidability  of  the  DRP  and  DRRP  Properties 

j 

Parsers  for  DRP  and  DRRP  grammars  have  many  desirable 
features.  Unfortunately,  there  can  be  no  algorithm  that 
will  determine  if  an  arbitrary  TOG  is  DRP  or  DRRP.  This  I 
is  proven  below. 


5-30 


Theorem  5 . 10 


It  is  undecidable  whether  an  arbitrary  TOG,  G,  is 
DRP  or  DRRP . 

Proof 

If  G  is  a  CFG,  the  question  is  equivalent  to  deciding 
if  there  exists  a  k  such  that  G  is  LR(k)  .  This  question 
is  undecidable  (Hopcroft  and  Ullman  1969)  .  □ 

Some  other  interesting  properties  of  RP  languages 
and  their  subsets  will  be  discussed  in  Chapter  7.  In  the 
next  chapter,  we  present  a  parser  generator  that  will 
construct  a  deterministic  parser  for  some  DRP  grammars. 
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CHAPTER  6 


A  PARSER  GENERATOR  FOR  DRP  GRAMMARS 


In  this  chapter,  we  describe  and  examine  a  parser 
generator  which  will  build  a  D2SM  with  the  error  property  for 
some  tog's.  The  algorithm  will  be  very  similar  to  Knuth's 
LR(k)  algorithm.  Our  notation  and  description  will  deliberately 
parallel  the  terminology  used  in  our  review  of  Knuth's 
a 1 go  r i thm . 

Description  of  the  Parser  Generator 


Let  m  be  any  non-negative  integer  and  let 
^A  ~  ^^A ’ ^A ’ ^ A ’ ^A^  m-augmented  grammar  derived  from  the 

TOG,  G  =  (N,T,P,S).  The  algorithm  will  create  the  finite  state 

control  based  on  CS(G^). 

For  each  state  q  in  the  2SM  a  unique  state  set,  Q(q), 
is  constructed.  Q(q)  is  made  up  of  i terns ,  (p , j ,w ,T (w) ) ,  where 
p  is  the  number  of  one  of  the  productions  (specified  by  its 
apply  symbol),  j  is  the  location  (indexed  from  zero)  of  one  of 

•  •  •  m  TT  ^ 

the  symbols  on  the  right  side  of  production  p,  w£(N  uT  )  = 

is  a  right  context  string  and  T(w)  is  a  tag  function  associated 

with  w.  There  will  often  be  many  items  in  a  state  set  with  the 

same  p  and  j  but  different  right  context  strings.  Occasionally, 

it  is  more  convenient  to  represent  an  item  by  (x  ^  y  ^ '^Yy  ^  ^p  » w  ,  T  ( w)  ) 

t  tl 

where  the  a  precedes  the  j  symbol  in  y^Yy^Ap.  The  existence  of 
such  an  item  in  a  state  set  implies  that,  if  Cue  finite  control 
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were  in  state  q,  it  could  read  a  Y  and  this  Y  might  be  part  o 
an  occurrence  of  the  right  side  of  production  p.  If  this  is  n 
case,  then  the  item  tells  us  that  having  read  we  would 

expect  to  see  w  at  the  top  of  the  R-STACK.  T(w)  provides 
informat ionaboutthenatureofw.  i 

Right  context  strings  are  the  analogue  of  Knuth's  loo^ 
ahead  strings.  These  strings  are  used  in  a  new  way  in  this 
a 1  go  r i thm . 

From  our  description  of  w  it  should  be  clear  that  if 
can  appear  at  the  top  of  the  R-STACK,  then  any  string  z 
(wy  =>*  z  for  some  y)  might  also  appear  there.  To  guarantee  i 
our  parser  generator  will  terminate,  we  limit  the  size  of  rig: 
context  strings  to  a  maximum  length  of  m  symbols.  Longer  rig'; 
context  strings  provide  more  information  about  what  symbols  m’ 
follow  a  production.  At  the  same  time,  a  larger  value  of  m  ci 
lead  to  many  more  possible  right  context  strings  and  hence  mo 
overhead  in  generating  the  parser.  Thus,  we  want  to  use  as  si. 
a  value  of  m  as  possible  and  still  maintain  the  necessary 
information  in  the  right  context  strings. 

To  compute  right  context  strings,  we  will  define  a  ! 

function,  H^,  which  may  be  applied  to  any  string  v  in  . 

The  value  of  the  tag  function,  T (w) ,  is  a  string  of  binary 

symbols  (0  or  1)  in  one  to  one  correspondence  with  symbols  in 

and  will  give  information  about  how  w  was  derived  using  H  . 

m 

Because  of  the  correspondence  between  w  and  T(w),  T(xy)  =  T(x] 


When  H  is  applied  to  a  string,  v,  which  has  an 


m 


associated  tag  T(v),  H  (v)  is  the  least  set  such  that 


m 


(i)  FIRST  (v  ,iii)  eH  (v) 


m 


and  T  (FIRST (v ,m) )  =  F IR ST ( T ( v )  , m)  . 

(ii)  If  xyzeH  (v)  and  if  there  is  a  production 

m 

y  ^  ueP,  then 

^  A 

F IRST ( XU z , m) £ H  (v). 

m 

T (F IRST (xuz , m) )  =  F IRST (T ( x ) T ( u ) T ( z )  ,  m)  where 


T(u)  =  1 


0 


u 


u 


if  there  was  a  1  in  T(y) 


otherwise  . 


(iii)  If  xyeH  (v)  ,  yePREFIX(z),  y  z  ,  and  z  ->  ueP 

m 


then 


F IRST ( XU , m) £ H  (v)  and 

m 

T (FIRST (xu,m) )  =  F IRST ( T ( x ) 1 ' ^ ^  , m) . 


The  implications  of  the  tag  function  should  now  be 
apparent.  If  only  a  proper  prefix  of  the  left  side  of  a 
production  is  in  a  string  in  (as  described  in  (iii)),  then 

symbols  in  the  tag  corresponding  to  symbols  of  the  right  side 
of  the  production  are  1.  If,  as  in  (ii),  members  are 
generated  from  these  symbols,  then  this  prefix  information 
is  preserved  and  propagated.  Furthermore,  since  I’s  are 
introduced  by  case  (iii)  only,  tags  will  always  be  of  the  form 

As  an  immediate  result  of  this  discussion,  we  can 
define  a  partial  ordering  between  items.  We  shall  say  that 
(Pjj>w,T(w))  >  (q  ,  i ,w '  ,T (w ' ) )  if  and  only  if  p  =  q,  i  =  j. 
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w  =  w',  T (w)  =  0^1^,  and  T(w’)  =  0^1^  where  k  >  s.  Thus 

(p»j5^»T(w))  >  ( q  ,  i  ,  w '  , T (w ' ) )  if  and  only  if  w  and  w'  are 

the  same  string,  p  =  q,  i  =  j,  and  we  are  certain  of  more 
symbols  in  w  than  in  w’.  The  relations  <,  ^  are  define( 

in  the  obvious  way.  These  relations  are  useful  because,  i; 
(PjjjW>T(w))  >  ( p , j , w ' , T (w ' ) )  and  they  are  both  in  a  state 

set,  we  can  delete  the  second  item  since  the  first  containj 
all  the  information  that  is  in  the  second. 

The  function  is  clearly  very  similar  to  the  fund 

used  in  the  LR(k)  algorithm.  Informally,  in  a  context 
free  grammar,  if  we  know  the  first  k  symbols  of  a  string  u 
then  we  can  compute,  using  ,  the  first  k  symbols  of  any 
string  derived  from  u.  Knowing  only  the  first  k  symbols  of 
u  is  sufficient  to  do  this.  In  a  general  type-0  grammar, 
the  first  k  symbols  of  a  string  derived  from  another  string 
uz,  often  can  depend  on  more  than  the  first  k  symbols  of  u. 
Since  we  only  know  these  k  symbols,  in  the  general  case,  H, 

K 

must  include  some  strings  that  depend  on  unknown  symbols.  i 

tag  function  reflects  this  lack  of  knowledge.  For  a  contex 

k. 

free  grammar,  the  tag  will  always  be  0  . 

State  sets  are  created  by  starting  with  some  initia 
set  of  items  and  computing  the  closure  of  this  set.  The 
inclusion  of  new  items  is  determined  by  two  closure  functio 
C  and  C'.  By  definition,  C((x  y  a  A  p  ,  w  ,  T  ( w )  )  )  =  (p  and 
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C((x  •>  y  Yy  2AP  ,w  ,  T  (w)  )  )  =  {  (Yv  ^  a  u  A  r  ,  w  ’  ,  T  (w  ’  )  )  | 

Y£Va  (and  if  v  =  e,  then  YeN^),  vw’eH^Cy^w)  with 
T (v)  =  0 ‘ ^ I  }  . 

This  definition  includes  the  important  special  case  v  =  e, 
when  T(e)  =  0^®^  =  e.  Informally,  if  (x  ^  y ^ a Yy ^ A p , w  ,  T ( w ) )  is 

in  the  state  set  and  if  we  are  certain  that  what  follows  Y 
(i.e.  72^)  derive  a  string  beginning  with  v,  then  the 

production  Yv  ^  uAr  could  be  applied.  Consequently,  we  include 
the  item  (yv  ^  a uA r , w ' , T ( w ’ ) )  in  the  state  set. 

The  closure  function  we  have  defined  is  also  similar 
to  that  used  in  Knuth’s  LR(k)  algorithm.  The  main  difference 
is  that  in  our  case  we  must  deal  with  productions  that  are  not 
context  free.  The  right  context  strings  are  used  to  prevent 
certain  items  from  being  added  to  the  state  set  Q(q). 

Clearly  there  are  cases  when  C  does  not  add  an  item 
to  the  state  set  because  we  are  uncertain  whether  it  should 
be  included.  This  occurs  if  T(v),  in  the  definition  of  C, 

I  w  ’  I 

contained  a  1.  Note  that  in  these  circumstances  T(w’)  =  l'  . 

Such  items  may  be  added  to  the  state  set  later  by  completing 

some  other  item  in  Q(q).  More  generally,  if  we  are  uncertain 

about  some  item,  ( p , j  , w '  ,  T (w  ’  ) )  ,  and  there  is  another  item 

in  the  state  set,  ( p , j , w , T (w ) ) ,  then  we  can  add  ( p , j , w ’ , T (w ' ) ) 

to  the  state  set  even  if  w  w’  .  Informally,  since 
I  w  ’  I 

T(w*)  =1'  '  ,  the  inclusion  of  this  item  canno  t  cause  the 

addition  of  some  new  item  containing  a  production  that  is  not 
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context  free.  This  is  why  its  inclusion  is  allowed. 

If  there  are  uncertain  items  that  are  not  eventual 
included  in  the  state  set,  then  the  algorithm  will  termina 
In  this  case,  it  has  been  unable  to  determine  whether  G  is 
regular  parsable. 

The  condi t ional  clos  ure  f  unc  t ion,  C',  identifies 
the  uncertain  items  discussed  above.  It  is  defined  as  fol 

C’((x  ^  y ^AYy ^Ap  ,w  ,T (w) ) )  =  { ( r , 0 , w ’  , T ( w * ) )  | 

production  r  is  Yv  ^  uAr,  and  either  vw'eH  (y 

-  m 

and  T (v)  contains  a  1  o r  v'eH^(y2w), 
v'ePREFIX(v)  ,  v'  v,  and  w'  =  e}. 

The  actual  algorithm  for  computing  a  state  set 
is  straightforward  with  these  definitions  and  is  given 
b  e 1 ow . 

The  Algorithm  for  State  Set  Closure 

The  description  of  the  algorithm  is  given  in  an 
ALGOL-like  notation.  Structure  is  indicated  by  textual 
indentations . 
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:urient  state_set  =  i n i t i a l_s t a t e_s e t 
cond  i  t  i  ona  l_i  t  enis_s  e  t  =  (j) 
o  1  d_s  e  t  =  cj) 

while  old  set  cur  rent_s  ta  te_s  e  t  djo 
old_set  =  c ur r en t_s t a t e_s e t 
for  each  ( p  ,  j  ,  w  ,  T  (w )  )  e  o  1  d_s  e  t  d_o 

conditional  items_set  =  cond i t i ona l_i t ems_s e t  u 
C  '  (  (p  ,  j  ,w , T (w) ) ) 


for  each  (q , i , v , T ( v) ) e C (  (p  ,  j  ,w , T (w) ) )  ^ 
i f  (q,i,v,T(v))  >  ( r , k , u , T ( u ) )  for  some 

(r,k,u,T(u))£  current_s  tate_set 
then  replace  (r,k,u,T(u))  by  (q,i,v,T(v)) 
else  i f  (q , i , V ,T (v) )  ^  (r,k,u,T(u))  for  any 

(r,k,u,T(u))e  current_s  tate_se t 
then  cur r en t_s t a t e_s e t  =  current_s tate_se t 
u  {(q,i,v,T(v))} 


end  for 

for  each  ( p  ,  j  ,  w  ,  T  (w)  )  e  condi  t  i  ona  l_i  t  ems_s  e  t  d_o 
if  (p,j,w,T(w))  <  (p , j ,w ' ,T (w ’ ) )  for  some 

(p>j  »w’  ,T(w'))£current_s  tate_se t 

then  delete  (p,j,w,T(w))  from  condi tional_i tems_s e t 
else  i f  there  is  an  item 

(p>j  jw’  >T(w'))£  cur r en t_s  t  a  t e_s  e  t 
then  current_s tate_se t  = 

current  state  set  u  {  ( p , j  , w  ,  T ( w ) ) } 


end  for 
end  for 
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end  while 


i  f  condi  t  ional_i  tems_s  e  t  ^  (p 

then  terminate  the  generation  of  the  2 SM . 


Construction  of  a  2SM  from  the  State  Set 


Once  a  state  set,  Q(q),  has  been  computed,  we  can 
construct  the  transitions  from  q.  This  process  closely 
parallels  the  LR(k)  algorithm.  For  any  Ye (N^uT^) ,  let  us 
define  t(Y)  to  be  the  set 

{(u  ^  y  ^  AYy  ^Ap  ,w  ,  T  (w)  )  |  (u  y  ^  AYy  ^Ap  ,  w  ,  T  (w)  )  e  Q  (q 

If  t(Y)  is  not  empty  then  there  is  a  read  transition,  Y,  t 
state  q’.  The  initial  state  set  for  q'  is 

{(u  ^  y^YAy^Ap ,w ,T (w) )  ]  (u  ^  y ^YAy^ Ap ,w , T (w) ) e T (Y 

Similarly,  for  every  apply  symbol  Ap  we  define  t(A 
T(Ap)  =  {(u  yAAp,w,T(w))  |  (u  yA  Ap  ,  w  ,  T  (w)  )  e  Q  (q 
If  T(Ap)  is  not  empty,  then  q  has  a  reduce  transition,  Ap, 
a  special  state  q^.  q^  has  no  transitions. 

As  in  the  LR(k)  case,  productions  of  the  form  u  ^ 
receive  special  treatment.  In  this  case,  the  Ap  transitio 
replaced  by  a  read  transition,  e,  to  a  new  state  that  has 
single  reduce  transition,  Ap ,  to  q^ .  q  will  be  an  inadequ 
state  because  of  this  new  transition. 
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The  algorithm  continues  by  finding  the  closure  of 
any  new  initial  state  sets  in  the  same  way.  To  start,  the 
initial  state  set  of  the  start  state,  q^,  is 
{  (S  ^  ASi"^AO,e  ,e)  }  . 

J\ 

The  parser  generator  terminates  when  no  initial  state 
sets  remain  to  be  closed  or  when  the  state  set  closure  process 
calls  for  the  parser  generation  to  end.  For  notational  _ 
convenience,  we  will  refer  to  the  2SM  constructed  by  this 
algorithm  as  M(G). 

Since  the  parser  generator  uses  a  fixed  value,  m, 
for  the  length  of  right  context  strings  and  the  number  of 
productions  is  finite,  the  number  of  possible  items  is  also 
finite.  Consequently,  the  number  of  possible  state  sets  is 
finite  and  the  parser  generator  must  terminate. 

Look  Ahead 

The  2SM  that  has  been  created  may  contain  inadequate 
states.  In  this  section,  we  propose  several  look  ahead 
algorithms . 

The  DRP  Algorithm 


If  m  >0,  then  the  right  context  strings  can  be  used 
as  look  ahead  strings  just  as  they  are  in  the  LR(k)  algorithm. 
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For  any  item  ( p , j , w , T (w ) ) ,  let  L(w)  be  the  longest  prefix  o 
w  such  that  T(L(w))  =  This  notation  will  also  be 

useful  in  later  theorems. 

Let  q  be  an  inadequate  state  with  state  set  Q(q)  . 
Define  the  sets  t(Y)  and  T(Ap)  as  above.  We  will  recreate 
the  transitions  for  state  q  using  the  right  context  strings 
For  any  symbol  t  (t  can  be  in  or  t  can  be  an  apply 

symbol)  and  any  state  q,  we  define  'i'^Cqjt)  =  {e}.  For 
k  >  0,  if  T(t)  is  empty,  then  'i^Cqjt)  =  cj)  .  If  k  >  0  and 
t  =  Ap ,  then 

i'  CqjAp)  =  {w’  I  w'  =  FIRST(L(w),k)  and 

(u  ->  yAAp  ,w  ,  T  (w)  )  e  T  (Ap  )  }  . 

If  k  >  0  and  t  =  Ye (N  uT  )  then 

A  A 

'i'^(q,Y)  =  {Yw’  I  w’ e (q"  ,  t  ’  )  ,  M(G)  has  a  read 

transition,  Y,  to  q",  and  t’e(N  uT  ) 

A 

or  t ’  is  any  apply  symb  o 1 }  . 

If  t(Y)  is  not  empty,  then  q  has  a  read— and— look-ahead 
transition  to  state  q'  for  every  element  in  the  set  'i^^(q,Y). 
q’  is  the  same  state  that  the  original  read  transition,  Y, 
accessed.  If  there  are  two  elements  of  this  set,  say  Yv^  an 
Yv^  ,  and  Yv^  e  P  RE  F IX  ( Y  v^  )  ,  then  remove  the  ^'^2  transition  (it 
is,  in  effect,  "covered"  by  YV^). 

Similarly,  if  T(Ap)  is  not  empty,  then  q  will  have 

look  ahead  transitions  for  every  element  in  ¥^(q,Ap)  to  a  ne\ 

t  h 

state  with  a  single  reduce  transition  Ap  .  If  the  p  product 
has  an  empty  right  side  (u  eAp)  ,  then  the  transitions  are 
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read-and-look— ahead  transitions  for  elements  of  the  set 

{ev  I  vei'  (q  ,  Ap)  }  . 
m 

Once  again,  if  two  transitions,  and  (or  ev^  and  ev^),  are 

created  and  e PRE F IX ( ) ,  then  discard  the  (or  ev^) 

t  r ans i t i on . 

We  shall  see  that  if  the  resulting  2SM  is  deterministic, 
then  it  also  has  the  error  property.  Some  inadequate  states 
may  not  be  resolved  because 

(i)  the  grammar  is  not  DRP  or  longer  look  ahead  strings 
are  required.  Increasing  the  value  of  m  may  help 
in  the  second  case; 

(ii)  it  is  a  characteristic  of  the  closure  function  that 
not  all  right  context  strings  will  be  m  symbols 
long.  In  fact,  there  are  cases  where  empty  right 
context  strings  are  created,  independent  of  the 
value  of  m. 

Clearly,  there  will  be  instances  when  inadequate  states 
are  resolved  and  the  full  length  of  the  look  ahead  strings  is 
not  required.  In  these  cases,  we  can  optimize  the  performance 
of  the  2SM  by  using  prefixes  of  the  look  ahead  strings  we 
discussed  above.  In  practice,  we  would  try  to  resolve  an 
inadequacy  by  using  look  ahead  strings  composed  of  a  single 
symbol  only.  If  this  fails  to  resolve  the  inadequacy,  then  we 
would  compute  longer  look  ahead  strings. 

For  context  free  grammars,  our  parser  generator  together 
with  the  DRP  look  ahead  algorithm  is  equivalent  to  the  LR(k) 
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algorithm.  In  fact,  L]R(I-)  grammars  form  a  proper  subset  'f 
the  class  of  grammars  handled  by  the  parser  generator. 

If  the  DRP  algorithm  fails,  we  can  use  the  entirt. 
right  context  string,  ignoring  the  value  of  the  tag  functc 

The  Right  Context  Algorithm 

This  algorithm  is  essentially  the  same  as  the  DRT 
algorithm.  However,  in  this  case,  the  entire  right  contet 
string  may  be  used  instead  of  the  prefix  tagged  with  zerc, 

The  problem  with  this  algorithm  is  that  some  look 
ahead  strings  may  be  included  that  would  eventually  be  for 
to  be  erroneous  during  an  actual  parse.  These  extraneous! 
aheads  may  prevent  the  algorithm  from  resolving  an  inadeqi 
state. 

If  none  of  these  techniques  is  successful,  the 
general  look  ahead  algorithms  of  Chapter  3  can  be  applied 

The  Effect  of  **«*•  on  the  Parser  Generator 

! 

It  is  clear  that  the  value  of  m  used  in  the  funct) 

H  has  two  effects 
m 

(i)  It  may  affect  the  ability  of  the  algorithm  to 
construct  a  2SM. 

(ii)  Even  if  a  2SM  is  constructed,  the  abilty  to 

resolve  inadequate  states  may  be  affected.  j 
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The  value  of  m  must  be  chosen  so  that  a  2SM  can  be  constructed 


and  so  that  all  inadequate  states  can  be  resolved.  In  the 
LR(k)  algorithm,  the  value  of  k  can  only  affect  our  ability 
to  resolve  inadequate  states. 

Failure  to  construct  a  2SM  occurs  because  an  item 
remains  in  the  conditional  items  set  after  a  state  set  is 
closed.  We  can  infer  from  this  failure  that  either  G  is  not 
RP  (that  is,  CS(G)  is  not  a  regular  set)  or  G  is  RP  but  some 
right  context  strings  do  not  contain  sufficient  information 
for  the  parser  generator  to  continue.  There  are  RP 
grammars  for  which  the  parser  generator  will  fail  for  all 
m  ^  0.  The  DRP  grammar  G  =  ( { S , A , B , S  } , {i , a , b , c } , P  ,  S  ) 

A  A 

where  P  contains 


S  -V  ABSc 
S  ->  Abe 
BA  AB 
Bb  -*■  bb 
Ab  ->■  ab 
Aa  ->■  aa 

is  an  example.  Thus  our  algorithm  will  not  construct  a 
D2SM  for  all  DRP  grammars.  We  shall  show,  in  Chapter  7,  that 
in  fact  a  general  algorithm  does  not  exist. 
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Relation  to  D2SM*s  with  the  Error  Property 


This  section  studies  the  relationship  between  the 
parser  generator  and  D2SM’s  with  the  error  property.  First, 
we  prove  that  if  the  algorithm  succeeds  in  constructing  M(G) 
(the  2SM  before  look  ahead  is  added),  then  G  is  RP . 

Let  G  =  (N,T,P,S)  be  any  TOG  and  let  G^  = 

be  the  m-augmented  grammar  derived  from  G.  Assume  that  the 

parser  generator  algorithm  succeeds  in  constructing  the  2SM 

M(G).  An  item  (u  y  ^Ay  ^  Ap  ,  w ,  T  (w)  )  in  a  state  set  Q(q)  is 

essent ial  if  and  only  if,  for  all  state  strings 

(V^,q^)...(V^,q^)P(y^)q, 

S.  =>*  V,...V  uL(w)x  =>  V  ...V  y-y.,L(w)x 
A  1  n  1  nl2 

for  some  string  xe(N^uT^)*. 

The  definition  of  essentiality  is,  in  effect,  a 
restatement  of  the  error  property.  The  lemmas  that  follow  w 
show  that  all  items  in  a  state  set  are  essential. 

Lemma  6 . 1 


If  all  items  in  the  initial  state  set  of  a  state  q 
are  essential,  then  all  items  in  Q(q)  are  essential. 

Proo  f 


Assume  (x  Y,...Y  a  Yy  «  A  p  ’  ,  w  ,  T  ( w  )  )  is  in  Q(q)  and  is 

1  m  z 
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essential.  Thus,  for  any  state  string 


vq  =  v' )  .  .  .  (Y^,q^Mq  in  M(G)  with  v’  =  (V^  ,  q^)  .  .  .  (  ,  q^)  , 

there  is  a  string  t  in  such  that 

S,  E>*  Vi...V  xL(w)t  =>  Vt...V  Yi...Y  Yy«L(w)t. 

A  In  initnz 

Let  us  denote  the  string 

The  above  item  can  add  new  items  to  Q(q)  in  only  two 

ways . 

Add i t ion  ( i ) - 

The  new  item  (Yu  az  Ap  ,  w'  ,  T  (w  *  )  )  is  added  because  it 
is  in  C((x  ^  Y^ . . . Y^AYy2Ap ' ,w , T (w) ) ) .  From  C’s  definition,  we 
know  that 

(a)  uw’eH  and 

m  z 

(b)  T(u)  =  0 1 

Thus  uL(w')eH  (yoL(w))  and  so,  from  H  's  definition, 

m  z  m 

y2L(w)  =>*  uL(w')s  for  some  seCN^uT^)*,  where  the  last  step 
in  the  derivation  involved  some  symbols  in  u. 

There  are  two  cases  to  consider. 

(a)  y2  ¥  e. 

In  this  case,  we  immediately  have 

S.  =>*  V,...V  xL(w)t  =>  Vt...V  y-Yy«L(w)t 
A  1  n  1  n-^1^2 

=>*  V,...V  y-YuL(w*)st  =>  V-...V  y.2L(w')st. 

1  n*^l  '  1  n-^1 

Thus  (Yu  -►  A  z  Ap  ,  w  *  ,  T  (w  ’  )  )  is  essential. 

(b)  y2  =  e. 

The  derivation  =>*  . . . V^y ^Yy 2L (w) t 

can  be  written 

E>*  v"L(w)t  =>*  V^. . . V^xL(w) t  E>  V^. . .V^y^Yy2L(w) t 
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where  the  last  step  in  deriving  v’'L(w)t  produce 

I 

I 

some  of  the  symbols  in  L(w)t  and  subsequent  ste 

involve  no  symbols  in  L(T-7)t.  Thus, 

S.  =>*  v”L(w)t  =>*  v"uL(w’)st  =>*  V-...V  xuL(w’ 
A  in 

=  >  V....V  y^YuLCwMst  =>  V  ...V  yTzL(w’)st. 

1  n*^  1  1  n-^  1 

In  this  case,  the  new  item  is  also  essential 

Add  it  ion  ( i i ) - 

The  new  item  could  be  (x  -►  y  ^^Yy  2Ap  '  ,w  '  ,  T  (w  '  )  )  wher< 
this  item  was  in  the  conditional  items  set.  In  these 

I  w '  I 

circumstances,  y^  =  e,  T(w*)  =  1'  '  and  L(w’)  =  e.  Thus  tl 

der ivat ion , 

S.  =>*  V-...V  xL(w)t  V  ...V  y  Yy^L(w)t 
A  1  n  1  nl  2 

implies  that  the  new  item  is  essential. 

The  item,  (x  -►  y^AAp  ’  ,w  ,T  (w)  )  ,  can  cause  the  additi( 
of  a  new  item  in  the  second  way  only  (addition  (ii)).  The 
reasoning  in  that  case  is  still  valid. 

The  algorithm  to  compute  Q(q)  involves  a  finite  numl 
of  steps,  each  adding  new  items.  Since  all  initial  items  ai 
essential,  the  argument  above  will  establish  that  all  items 
in  0(q)  after  any  step  are  essential.  Consequently,  all  it< 
in  0(q)  are  essential.  □ 

We  will  not  use  Lemma  6.1  directly.  However,  the 
reasoning  used  in  its  proof  will  be  invoked  in  the  next  lemr 
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Lemma  6 . 2 


For  any  state  q,  all  items  in  Q(q)  are  essential. 


Proof 


Consider  the  start  state  first.  For  the  start  state,  q^, 
the  initial  item  is  (S^  ->  a Sl”^AO ,  e ,  e)  .  Since  does  not 

appear  on  the  right  side  of  any  productions,  by  construction,  q^ 
cannot  be  the  destination  state  of  any  read  transition  of  M(G) . 
Thus  the  derivation  =>  Si”*  implies  that  the  initial  item 
is  essential.  From  Lemma  6.1,  all  items  in  Q(q^)  are  essential. 


Let  q^  be  any  state  other  than  q^^,  and  let 
(x  -►  y^  Ay  2  Ap  ,w  ,  T  (w)  )  be  any  item  in  Q(q^).  Furthermore,  let 
X^...Xj  be  any  string  such  that  M(G)  has  read  transitions 
so  that 


and  X X  , 
1+1  n-1 


]^x  2^ . . .  X  j  y^^  [- 

1-  (X2^,q2^)  .  .  . 

y^^.  We  will  prove  that 


S^  =>*  Xj^ .  .  .  X^  xL  (w)  t  =>  X^^ .  .  .  X  j  y^^y  ^L  (w)  t 


for  some  string  t. 


Each  state  set,  Q(q^)  (1  <  i  <  n-1),  contains  a  subset 

E(X^)  defined  by 

E(X^)  =  {(u  v^^aX  2^V2  Ap  ,  w ,  T  (w)  )  | 

(u  -►  Vj^  aX2^V2  Ap  ,w  ,  T  (w)  )  eQ  (q^^)  }  . 

From  our  discussion  of  the  start  state,  all  items  in  ECX^^)  are 
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essential.  Thus,  for  any  (u  -*■  P  ,w  ,  T  (w)  )  e  E  (X^)  , 

=>*  v^X^V2L(w)t  for  some  string  t  (in  this  case,  =  e)  . 

Assume  that,  for  some  i  <  n, 

Sa  =>*  X^ . . . X^uL (w) t  =>  X^ . . , X^v^X ^V2L (w) t 

=  X.,.X.v«L(w)t 
1  1  2 

for  all  (u  v^AX^V2Ap,w,T(w))eE(X^).  This  is  certainly 
true  for  i  =  1.  Now,  the  initial  state  set  for  will  be 

{(u  -►  v^X^Av2Ap  ,w,T  (w)  )  I 

(u  -►  ax^V2  Ap  ,  w  ,  T  (w)  )  e  E  (X  , 

Any  item  in  either  in  the  above  set  or  found  by 

using  the  closure  functions.  If  any  element, 

(u  ->  v^aX^^^V2  Ap  ,w ,  T  (w) )  ,  of  E(X^_^^)  is  in  the  initial  state 
set,  then  we  immediately  have 

X^...X^_^j^V2L(w)t. 

If  it  is  not  an  initial  item,  then,  following  the  reasoning 
of  Lemma  6.1,  we  have 

=>*  X^ . . . X^uL (w) t  =>  X^ . . . X^v^X^^^V2L (w) t 

=  X^  .  .  .  X^_j^^V2L  (w)  t  . 

If  i  n,  then  we  repeat  this  argument  for  the  set  E(X^^^). 
Finally,  when  i  +  1  =  n,  any  item  (x  -►  y  Ay  ^  A  p  ,  w ,  T  (w)  )  in  Q(q 
is  either  an  initial  item  or  is  found  by  the  closure 
algorithm.  Using  the  above  argument,  we  have 

Sa  =>*  X^ .  .  . X  .xL  (w) t  =>  .  . .X .y^y2L(w) t 

as  required. 

Since  this  is  true  for  any  state,  any  item,  and  any 

state  string  (X  ,q  )...(X  ,,q  -)q  ,  every  item  must  be 

i  1  n—i  n—i  n 

essential.  Q 
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It  should  be  clear  that  M(G)  is  the  2SM  constructed 
from  the  set  CS(G^).  Hence,  the  following  results  can  be  proved. 

Lemma  6 . 3 


If  the  parser  generator  succeeds  in  constructing  M(G) , 
then  G  is  RP . 

Proof 


Let  q  be  any  state  such  that  (u  ->■  yA  Ap  ,w  ,T  (w)  )  c  Q  ( q) 

for  some  w.  From  Lemma  6.2,  (V,,q.)...(V  ,q  )P(y)q  is  a  state 

i  i  n  n 

string  ending  in  q  if  and  only  if 

S.  =>*  V^...V  uL(w)t  =>  V,...V  yL(w)t 
A  1  n  1  n-^ 

for  some  string  t.  But  . . . V^y eCS (G^) ,  Thus ,  in  the 
terminology  of  Chapter  3, 

M(G)  =  M(G,CS(G^)) 
and  CS(G^)  is  a  regular  set.  Since 
CS(G^)  =  CS(G)  u  {Si"ao}, 

CS(G)  is  regular  by  various  closure  properties  of  regular  sets 
(Hopcroft  and  Ullman  1969).  Consequently,  G  is  RP .  □ 

Corollary 

M(G)  accepts  the  language  L(G  )  =  L(G){i"'}. 

A 
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Proof 


Since  M(G)  =  M(G,CS(G^)),  this  corollary  follows  from 
Theorem  3 . 2 .  □ 

I 

Let  the  final  D2SM  be  M.  No  matter  which  look  ahea 
algorithm  was  used,  M  also  accepts  the  language  L(G  ). 

J\ 

The  next  theorem  establishes  the  exact  relationship 
to  D2SM's  with  the  error  property. 

Theorem  6 . 1 


If  M(G)  has  no  inadequate  states  (that  is,  M  =  M(G) 
or  if  M  is  constructed  from  M(G)  using  the  DRP  look  ahead 
algorithm,  then  M  is  a  D2SM  with  the  error  property. 

Proof 

All  items  in  the  state  sets  of  M(G)  are  essential. 

any  right  context  string,  w,  the  DRP  algorithm  only  uses  L( 

Thus  the  definition  of  essentiality  is  simply  a  restatement 

of  the  error  property,  for  if  (V^  ,  q^)  .  .  .  (V^  ,  q^)  P  (y^^)  q  is  an 

state  string  and  (u  ->  y^^  Ay  ^  Ap  ,  w  ,  T  (w)  )  e  Q  (  q  )  ,  then 

S.  =>*  V-...V  uL(w)t  =>  Vt...V  yTy«L(w)t 
A  1  n  1  n-'l'^2'^ 

for  some  string  t.  Thus,  from  Lemma  6.3  and  Theorem  3.2, 
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(V^,q^)  .  .  .  (V^,q^)P(y^)qy2L(w)t  |-  * 

(Vi,qi)  .  .  .  (Vn,qn)P(yiy2)'^ |- 
(V^  ,  q^)  .  .  .  (V^  ,  q^)  quL  (w)  t  |-  *  ACCEPT. 

A  similar  result  holds  if  J -^y 2  “  D 

Finally,  we  can  show  that  M  is  a  halting  D2SM. 

Theorem  6 . 2 

Let  M  be  the  D2SM  constructed  using  the  parser  generator 
and  any  of  the  look  ahead  algorithms.  M  must  halt  on  all 
inputs . 

Proo  f 


This  result  follows  directly  from  the  Corollary  to 
Theorem  5.1.  □ 

We  close  this  chapter  with  an  example  of  the  parser 
generator  in  action. 
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An  Example  of  the  Use  of  the  Parser  Generator 


1 


Consider  the  following  grammar  in  1-augmented  form. 

G  =  ({S^,S,A,B} ,{a,b,i} ,P,S^) 

P:  S,  ^  SiAO 

A 

S  ^  ASM 
S  ->  BA2 
B  bBA3 
B  ->  bA  4 
Ab  ^  bbAA5 
A.  ^  aA  6  . 

It  can  be  shown  that  L(G)  is  not  a  context  free  language. 
The  proof  is  very  lengthy  and  we  will  only  give  a  brief  summary 
here  . 

Theorem  6 . 3 

L(G)  is  not  a  context  free  language. 


Proof 


The  proof  uses  the  "uvwxy"  theorem  (Hopcroft  and  Ullman  19 

•  • 

We  may  observe  that,  for  all  strings  b  a^i  in  L(G),  i  is  a  multip 
of  2^.  Moreover,  if  b^a'^b  a  leL(G)  then  k  =  p2  -  i2  for  some 
integer  p.  Using  these  relations,  we  can  prove  by  contradiction 
that  L(G)  is  not  context  free.  □ 
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2,  directly 


If  we  apply  the  parser  generator,  with  m  = 
to  G,  we  obtain  the  following  state  sets. 


State  1 : 

Q(l) 


{(S^  ->  ASiAO,e,e),(S  ^  aASA1,i,C),(S  ^ 
(A  AaA6  ,  Si  ,  00)  ,  (A  a aA 6  ,  Bi  ,  00 )  ,  ( A 

(A  A  a  A  6  ,  b  B  ,  00  )  ,  ( A  a  a  A  6  ,  AS  ,  00  )  ,  (  A 

(A  ^  A  a  A  6  ,  AA  ,  00  )  ,  (  A  ->  AaA6,AB,00),(A  ^ 
(A  ^  A  a  A  6  ,  aB  ,  00  )  ,  (  A  a  a  A  6  ,  Aa  ,  00  )  ,  ( A  ^ 

(A  ^  AaA6  ,  aa  ,  00)  ,  (A  ^  AaA6  ,  ab  ,  00)  ,  (A 

(B  ->  AbA4 , 1 , 0)  ,  (Ab  ->■  AbbAA 5  ,  l  ,  0 )  ,  ( Ab  ^ 
(Ab  ^  AbbAA5 , B ,0) , (B  ^  AbBA3,i,0)}. 


aBA2 ,1,0)  , 
AaA6  ,bi  ,00)  , 
AaA6,aS,00)  , 
A  a A  6 , aA , 00 )  , 
A  a A  6 , Ab , 00 )  , 
AaA6 ,bb  ,00)  , 
Abb AA5 , b  ,  0 )  , 


State  2  : 

Q(2)  =  {(S^  ^  SAiA0,e,e)}. 

State  3 : 

Q(3)  =  {(S  ->  AaSAI  ,1 ,0)  ,  (S  ^  aASAI  ,1 ,0)  ,  (S  ^  aBA2,1,0), 

(A  AaA6  ,  Si ,  00 )  ,  .  .  .  ,  ( A  a  a  A  6  ,  bb  ,  00  )  ,  (  B  ^  AbA4,l,0), 
(Ab  AbbAA5 ,1 ,0)  ,  (Ab  ->  Abb  AA  5  ,  b  ,  0  )  ,  ( Ab  ^  AbbAA5,B,0), 
(B  ->■  AbBA3,i,0)}. 

State  4 : 

Q(4)  =  { (S  ^  BAA2 ,1 ,0)  }  . 

State  5  : 

Q(5)  =  {(A  ^  aAA6  ,  Si  ,00)  ,  .  .  .  ,  (A  ^  aA A6 , bb  ,  00 )  }  . 
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State  6 : 


Q(6)  =  {  (B  bA  A  4  ,1  ,  0 )  ,  (Ab  ^  bA  b  AA  5  ,  B  ,  0 )  ,  ( Ab  ^  bAbAA5,i 

(Ab  ^  bA  b  Aa  5  ,  b  ,  0 )  ,  ( B  ^  bA  Ba  3  ,l  ,  0  )  ,  ( B  ->■  AbA4,l,0) 

(B  ^  AbBA  3  ,1 ,0) i  . 

State  7 : 

Q(7)  =  {  (S^  ^  SiAA0,e,e)}  . 

State  8  : 

Q(8)  =  { (S  ^  ASaAO ,1,0)}. 

State  9 : 

Q(9)  =  { (Ab  ^  bbAAA5 ,B ,0) , (Ab  ^  bb a AA 5 , i , 0 ) , ( Ab  ^  bbAAA 
(B  bAA4,l,0),(B  ^  bABA3 ,1 ,0)  ,  (A  ^  AaA6,B,0), 

(A  ^  AaA6,b,0),(A  ^  AaA6,l,0),(A  ^  AaA6,bB,00), 
(A  ->  A  aA  6  ,  bb  ,  00  )  ,  (Ab  Abb  AA5  ,  e  ,  e  )  ,  (Ab  ^  AbbAA5, 

(Ab  Abb  AA5  ,  b  ,  0  )  ,  ( B  AbA4,i,0),(B  AbBA3,i,0) 

State  10 : 

Q(10)  =  {  (Ab  ^  bbAAA5  ,B  ,0)  ,  (Ab  ->  bbAA  A5  ,  i  ,  0 )  ,  ( Ab  ^  bbAA 

State  11: 

Q(ll)  =  { (Ab  ^  bAbAAS ,e ,e) , (Ab  ^  bAbAA5,B,0), 

(Ab  ->  bAbAA5  ,b  ,0)  ,  (B  ^  bAA4,l,0),(B  ^  bABA3,l,0 
(B  ^  AbA4,l,0),(B  AbBA3,l,0)}. 

State  12: 

Q(12)  =  { (Ab  ^  bb AAA5 , e , e ) , (Ab  ^  bbAAA5,B,0), 
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(Ab  ->  bbA  AA  5  ,b  ,0)  ,  (B  bAA4,l,0),(B  ->  bABA3,l,0), 
(A->  AaA6,e,e),(A-^  AaA6,B,0),(A-^  AaA6,b,0), 

(A  A  aA  6  ,bB  ,  00)  ,  (A  ^  A  aA  6  ,  bb  ,  00  )  ,  (  Ab  AbbAA5,e,e), 
(Ab  ^  AbbAAS ,B,0) , (Ab  ^  a bbAA 5 , b , 0 ) , ( B  ^  AbA4,i,0), 
(B  ^  AbBA  3  ,1  ,0)}  . 

State  13: 

Q(13)  =  {  (B  ^  bBAA3,i  ,0)}  . 


State  14: 


Q(14)  =  {(A  ^  aAA6,B,0),(A  ^  aAA6,b,0),(A  ^  aAA6,l,0), 
T  (A->  aAA6,bBi00),(A^  aAA6,bb,00)}. 


’  i 
/  % 


Stare  15:  /-^"v  ; 

Q(15)  =  { (Ab  ^  bbAAAS ,e ,e) , (Ab  ^  bbAAA5,B,0), 
(Ab  bbAA  A  5  ,  b  ,  0  )  }  . 


State  16  : 
Q(16) 


=  {(A  aAA6,e,e),(A  ->■  aAA6,B,0),(A  ->  aAA6,b,0), 

(A  aAA6  ,bB  ,00)  ,  (A  ->  aAA6  ,bb  ,00)  }  .  , 

^  V-  I  -  ' 


Note  that  states  6,  9,  11,'  and  12  are  ^inadequate.  All 

( 

the  inadequacies  can  be  resolved  using  the  DRP  look  ahead 
algorithm.  The  parser  can  be  illustrated  by  a  state  diagram. 
Look  ahead  transitions  are  shown  in  cur ly  bracke ts  . 
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a) 

>-1  x: 

CO  -U 
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This  example  not  only  demonstrates  the  parser  generator 
in  action  but  shows  that  it  can  construct  a  D2SM  with  the  error 
property  for  a  DRP  grammar  that  generates  a  language  that  is 
not  context  free. 
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CHAPTER 


7 


PROPERTIES  OF  DP  LANGUAGES 

Closure  and  decidability  results  about  classes 
of  languages  and  grammars  are  often  of  practical  interest. 

For  example,  it  would  be  interesting  to  know  if  general 
parser  generator  algorithms  exist  for  any  of  the  types  of 
grammars  we  have  investigated.  Clues  to  the  scope  of 
language  classes,  compared  to  other  known  sets  of  languages, 
can  also  be  obtained. 

The  emphasis  in  this  chapter  will  be  on  DRP  grammars 
and  languages.  Parsers  based  on  these  grammars  meet  all 
the  requirements  we  set  for  parsers  in  Chapter  3.  One  of 
the  more  practical  and  interesting  results  of  this  chapter 
is  that  for  any  TOG,  G,  and  any  given  integer  k,  there  is 
no  algorithm  to  determine  whether  there  is  a  set  ACS(G,k) 
that  satisfies  the  DRP  definition.  Consequently,  we  must 
be  satisfied  with  "approximate”  parser  generator  algorithms 
similar  to  the  one  we  described  in  the  last  chapter. 

The  properties  of  regular  languages  are  used  extensively 
in  the  following  discussion.  Some  of  the  useful  results  are 
reviewed  below. 
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Closure  Properties  of  Regular  Languages 


Among  other  interesting  results,  Salomaa  (Salomaa 
1973)  and  Hopcroft  and  Ullman  (Hopcroft  and  Ullman  1969) 
have  shown  that  if  and  are  any  two  regular  sets  then 


(i) 

u 

^2 

i  s 

a 

regular 

set; 

(ii) 

h 

n 

‘^2 

i  s 

a 

regular 

set; 

(iii) 

/ 

''2 

i  s 

a 

regular 

set; 

(iv) 

h 

\ 

'^2 

i  s 

a 

regular 

set; 

(v) 

i  s 

a 

regular  set ; 

(vi)  for  any  regular  set  R,  there  is  an  integer  k  and 
an  LR(K)  grammar  G  such  that  R  =  L(G)  ; 

(vii)  all  finite  sets  are  regular. 

Regular  Parsable  Languages  -  Inclusion  Results 


In  Chapter  5,  we  showed  that  all  DRP  languages  are 
RP  and  all  DRRP  languages  are  RRP .  We  will  now  establish 
several  other  connections  between  the  subsets  of  RP 
1 anguage  s . 

Theorem  7 . 1 

If  a  TOLjL,  is  RRP,  then  L  is  also  RP . 


Proof 


Since  L  is  RRP,  L  =  L(G)  for  some  RRP  grammar. 
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G  =  (N,T,P,S).  If  S^)  is  the  O-augmented 

grammar  constructed  from  G,  then 
RCS (G^)  =  RCS (G)  U  { SAG}  . 

Since  RCS(G)  is  regular,  RCS(G^)  is  Also  regular.  From 
RCS(G^)  we  can  construct  the  2SM,  M ( G , RC S ( G^ ) )  .  The  finite 
control  of  this  parser  is  based  on  RCS(G^)  (see  Ghapter  3 
for  details  of  the  construction).  From  Theorem  3.3, 

M(G,RCS(G^))  accepts  the  language  L(G). 

In  general,  M(G,RCS(G))  is  nonde t e rmin i s t i c .  However, 
it  is  an  error  property  2SM  without  look  ahead.  This  is 
a  direct  result  of  the  definition  of  RCS(G^). 

If  M(G,RCS(G))  =  (K,V,V,T,I),  then  construct  a 
A  L  K 

grammar 

Gm  =  (Vf  U  K  u  (Vj^-T)  u  {A},T,P,A) 

wh  ere 

P  =  { X  y  I  y  xe  I  and  y  ->■  x  i  s  no  t  an  accept  instruction} 
U  {a  ^  q^S  I  q^S  ->  ACCEPTel}. 

Since  M(G,RCS(G^))  is  an  error  property  2SM  without 
look  ahead,  from  the  corollary  to  Lemma  5.5,  G^  is  RP . 

If  we  add  the  production  q^^  eAp  to  P  to  obtain  a  new 
grammar  ’  ,  we  have 

CS(G^’)  =  CS(G^)  u  {Ap}. 

Regular  sets  are  closed  under  union  and  so  G..  ’  is  RP  . 

M 

Gj^’  generates  L(G).  Hence  L(G)  is  RP  as  required.  □ 
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In  the  deterministic  case,  we  can  prove  a  similar  result 
The  reasoning  will  be  analogous  to  that  used  in  the  last  prool 

Theorem  7 . 2 

If  a  TOL,  L,  is  DRRP ,  then  L  is  also  DRP . 

Proof 

Since  L  is  DRRP,  L  =  L(G)  for  some  DRRP  grammar  G.  Assu 

the  set  ARCS(G,k)  satisfies  the  DRRP  definition.  If  G=(N,T,P, 

let  G  =(N  ,T  ,P  ,S  )  be  the  k-augmented  grammar  constructed 
A  A  A  A  A 

from  G.  RCS (G  )  is  a  regular  set  (Theorem  5.3). 

J\ 

Assume  that  S  does  not  appear  on  the  right  side  of  any 

of  the  productions  in  P.  There  is  no  loss  of  generality  in.th 

assumption  since  the  0-augmented  grammar  derived  from  G  genera 

the  language  L(G),  is  also  DRRP,  and  satisfies  the  above 

condition.  Further  assume  that  if  (x , t ) eARC S (G , k)  then  either 
k 

X  =  Si  and  t  =  e,  or  x  contains  no  l  symbols  and  te{e,l}.  Th 
is  no  loss  of  generality  in  this  assumption  either. 

Our  strategy  will  be  to  construct  a  2SM  based  on  RCS(G  ) 

A 

A  grammar  is  constructed  from  this  2SM.  With  appropriate 
modifications,  this  grammar  generates  the  language,  L,  and  is 
also  DRP.  Our  assumptions  about  the  goal  symbol  of  G  and  the 
ARCS(G,k)  will  make  the  construction  simpler. 
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For  ease  of  understanding,  the  proof  has  been  divided 


into  sections. 


Construction  of  a  2SM  from  RCS(G^) 

The  construction  is  the  same  as  that  used  in  Theorem 
5.8  and  similar  to  the  construction  of  Theorem  5.5  (in  that 
theorem,  the  characteristic  strings,  rather  than  the  restricted 
characteristic  strings,  were  used). 

For  any  apply  symbol  Ap ,  define 
T(Ap)  =  {t  I  (wAp , t ) £ ARC S (G , k ) } .  T(Ap)  is  the  set  of  all 
look  ahead  strings  that  may  follow  the  right  side  of 

production  p  in  a  sentential  form.  If  T^  is  any  subset  of 
T(Ap)  then  define 

S(T^)  =  IwAp  I  (wAp,t)€ARCS(G,k)  for  all  teT^}. 

Finally,  let  RCS(G^,Ap)  =  {wAp  1  wApeRCS (G^) } .  As  a  result 
of  the  definition  of  the  DRRP  property,  there  are  subsets 
Tj^,...,T^  of  T(Ap)  such  that 

(i)  RCS(G^,Ap)  =  S(Tp  U...U  S(T^); 

(ii)  S(T^)  is  regular  for  all  j; 

(iii)  S(Tj)  n  S(Tj^)  =  (j)  if  and  only  if  j  ^  kJ  and 

(iv)  T  (Ap)  =  T  uT_u...uT  . 

I  Z  n 

A  2SM,  M,  is  constructed  based  on  RCS(G  ).  The 
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finite  control  is  designed  so  that,  for  every  set  S  (T  ), 

i 

there  is  a  state  q  with  a  Ap  transition  so  that 


X  .  .  .X  1-*  (X  q,  )  .  .  .  (X  q  )q. 
11  n  ^  ^  2 


using  read  transitions  only,  if  and  only  if  X ^ Ap g S (T _ ) 

(note  that  X  =  e  if  u  ->  eApeP  ).  By  construction,  M  is 
n  A 

an  error  property  2SM  without  look  ahead. 


Construction  of  a  Grammar  from  M 


We  construct  a  grammar  G^,  as  before. 

M 

M  =  (K,V  ,V  ,T,I)  ,  then 
L  R 

G  =  (V  U  K  u  (V  -T)  u  {A},T,P’,A) 
ML  R 


where 


If 


P'  =  {x  ^  y  I  y  ->  X€l  and  y  ^  x  is  not  an  accept 
ins  t ruction  } 

u  {A  q.S.  1  q,S.  ^  ACCEPT  is  the  accept 
1  A  '  1  A 

instruction  for  M}. 


G  is  DRP 
“M - 

Because  G  is  RRP ,  the  proof  of  Theorem  7.1  establishes 

that  G,,  is  RP . 

A 

The  special  assumptions  about  ARCS(G,k)  result  in  a 
very  useful  property.  Any  inadequate  state  in  M  is  in  one 
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I 


o  f  two  f  orms . 


(i)  A  state  with  one  reduce  transition  and  one  or 
more  read  transitions,  or 
(ii)  A  state  with  an  instruction  of  the  form 
q  ^  (e , q) q  ' 

and  one  or  more  other  read  instructions,  none 
of  which  refers  to  the  empty  string. 

Also  any  state  that  has  a  read  transition  for  i  has  no  other 
transitions.  Thus  the  look  ahead  set  for  the  reduce 
transition  in  (i)  or  the  transition  outlined  in  (ii)  consists 
of  1  only. 

M  has  only  read,  accept,  and  reduce  instructions. 

Lemma  5.5  is  valid  for  G^.  Let  q’y  (X,q')xqAl  be  any 
element  of  P’  created  from  a  reduce  instruction.  Then 
A  =>*  uq’yw  =>  u(X,q’)xqw 

Si 

if  and  only  if  u(X,q')xq  is  a  state  string  of  M. 

Furthermore,  from  our  observations,  we  know  that  if  q 

is  an  inadequate  state  with  a  single  reduce  transition, 

then  for  any  characteristic  string  u(X,q’)xqAl  in  CS(G  ), 

M 

A  =>*  u(X,q') xqw 

Ic 

if  and  only  if  w  =  l  . 

Similarly,  if  (X,q)q'  qXAl  is  in  P’,  then 
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Again,  if  q  is  an 


A  =>*  u(X,q)q’w  =>  uqXw 
G 

M 

if  and  only  if  uq  is  a  state  string. 

inadequate  state  with  a  read  instruction  for  the  empty  strin 

q  (e,q)q',  then  for  any  characteristic  string  uqAl  in 

CS(G  ) , 

M 

A  =>*  u(e,q)q’w  =>  uqw 

k 

if  and  only  if  w  =  i  . 

It  should  be  evident  that  the  special  methods 
used  in  constructing  M  guarantee  that  G^^  is  DRP .  If  k  >  0, 
then  we  define  the  sets  A^  and  to  be 

A^  =  {(xAljl)  I  xAleCS(G^),  the  1^^  production  is 
derived  from  a  reduce  instruction  or  an 
instruction  involving  the  empty  string,  this 
instruction  is  applicable  to  state  q,  and  q 
is  inadequate  }  . 

A  =  {(xAl,e)  I  xAleCS(G  )  and  (xAl,i)M,}. 

^  li 

If  k  =  0,  then  we  have 
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=  {(xAl,e)  I  xAl€CS(Gjp} 
and  A2  =  • 

If  is  the  k-augmented  grammar  derived  from  G^ 

(the  goal  post  in  G^^^  is  some  new  terminal  symbol,  say  -|  )  » 
then  the  set, 

ACS(G^,k)  =  A^  u  A^  u  {  (q^S^Ap,!^)  ,  (A  H^AO,e)} 

is  a  set  of  k-augmented  characteristic  strings  for  G^ . 

From  our  observations,  this  set  must  satisfy  the  DRP 

condition.  Thus  G  is  DRP. 

M 


Modification  of  G.,  so  that  it  Generates  L 
- M - - — - - 


Any  derivation  under  G^^  begins  as  follows: 
E>  (S  ,  q^)  (1  ,q2)  .  .  .  (1  ,qj^)  =>*  q^^Sl^. 


The  productions  used  in  this  derivation  are  not  used  after 


the  sentential  form  q^ Si  is  derived,  because  S  does  not  appear 
on  the  right  side  of  any  production  in  the  original  grammar  G. 


Let  G^  -  (N^ , T^ , P ’ , A) .  Then  define  a  new  grammar 

Nm'  =  u  {S, ' } 

M  M  A 

P"  =  P’  -  {A  ->  q^S^, 

'^l^A  ^  (S  ,q^)  (1  ,q2)  .  .  .  (1  q^^2  ’ 
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(S  ,q^)q2  ^  } 


k+1 


u  {a  q^SAp’,q^  ^  eAp,S^'  ->  Al  AO}. 

k  _  ■[<,+ 1 

Clearly  A  q,wl  if  and  only  if  S'  =>*  wl 

Gm 

k  +  1' 

Thus  S  '=>^  Wl  if  and  only  if  w  is  accepted  by  M  and 

so  =  L(Gj^'  )  . 


G 


M 


is  the  k+l-augmented  grammar  constructed  from 


a  grammar 

°m"  °  (N„'-{S^'J.T,P"'.A) 

where 

pMi  =  p”-{s^'  ^  Ai^'^^AO}. 
Clearly,  L  =  L(G^"). 


Let  us  define  the  set 

t  h 

A^  =  {(wAp,t)  I  (wAp  ,  t )  eACS  (Gj^^k)  and  the  p 

production  of  is  not  a  production  in 

The  set 

ACS(G„",k+l)  =  ACS  (G,^,k)"A„ 

u  {  (q^SAp  '  ,l)  ,  (Ai^^^AO ,e) } 

{(Ap,X)  1  A  =>*  q^Xw} 

Gm 

is  a  set  of  k+l-augmented  characteristic  strings  for 
the  grammar  .  Since 

(i)  G^  is  DRP ,  and 
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(ii)  by  construction,  no 


s  tring  of 


G^'  can  begin  with  the  symbol  X  if 
(Ap ,X) eACS (G^" ,k+l)  , 

G,,"  is  DRP  .  Thus  L  is  also  DRP  .  D 
M 

A  TOL,L,  is  recursive  if  and  only  if  there 
exists  an  algorithm  to  determine,  for  any  string  w,  whether 
w  is  in  L.  The  following  inclusion  results  are  easily 
proved . 

Theorem  7 . 3 


(i) 

If 

a  TOL, 

L, 

i  s 

DP  , 

then  L  is  recursive. 

(ii) 

If 

a  TOL  , 

L, 

i  s 

DRP 

,  then  L  is  DP  . 

Proof 

(i)  If  L  is  DP,  there  is  a  DP  grammar  G  such  that 

L  =  L(G).  For  some  integer,  k,  there  is  a  halting 

D2SM  that  accepts  the  language  L(G){i  }.  Thus  we  may 

•  Ic  •  * 

test  whether  any  string  wx  is  accepted  by  this  D2SM. 

The  test  is  an  algorithm  because  the  parse  of  wi^  must 

•  Ic 

be  unique  and  must  halt.  Since  wl  is  accepted  if 

and  only  if  w  is  in  L,  L  must  be  recursive. 

(ii)  From  Chapter  5,  we  know  that  any  D2SM  with  the  error 
property  is  a  halting  D2SM.  Thus  any  DRP  language 
is  also  DP,  □ 
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We  can  draw  an  inclusion  diagram  for  the  various 


classes  of  languages. 


type-0  languages 


1 anguage  s 


languages 


Figure  7 . 1 


LANGUAGE  INCLUSION 


RELATIONS 


1 anguage  s 


It  is  an  open  question  whether  these  inclusions  are 
proper  in  each  case.  The  relationship  between  recursive  an 
RP  languages  is  also  unknown.  We  have  not  investigated  the 
questions  since  our  main  interest  is  in  DRP  grammars  and 
languages . 
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Undecidability  of  the  RP  Property 


In  Chapter  5,  we  saw  that  it  is  undecidable  whether 
a  TOG  is  DRP  or  DRRP .  Since  all  DRP  grammars  are  RP ,  we 
might  take  De  Remer’s  approach  to  parser  generation.  We 
could  find  the  set  of  characteristic  strings  and  then  apply 
some  look  ahead  algorithm.  Unfortunately,  there  is  no 
algorithm  to  decide  whether  an  arbitrary  TOG  is  RP .  The 
following  construction  will  help  to  prove  this. 

Informally,  from  any  GFG,  G’,  we  will  construct  a 
GSG,G,  so  that  CS(G)  contains  L(G') .  We  then  show  that 
GS(G)  is  regular  if  and  only  if  L(G’)  is  regular.  Since 
it  is  undecidable  whether  L(G’)  is  a  regular  set  (Hopcroft  and 
Ullman  1969),  the  required  result  is  easily  proved. 

Cons  t rue  t ion 

Let  G*  be  any  CFG,  where  G'  =  (N ’  ,  T  ’  ,  P  '  ,  S  ’  )  . 

For  each  terminal,  a,  in  T',  define  a  new  nonterminal 
symbol  X  .  We  construct  a  CSG,  G  =  (N,T,P,S),  where 

Si 

N  =  N’  u  {S}  u  {X  I  a£T’},S^(N’  u  T'), 

T  =  T'  u  {|-,-|}  n  (N'  u  T’)  =  (j),  and 

P  contains  the  productions 

S  h  S’  H 

1-X^X^  ■>  |-aX^,  for  all  a,b€T’ 

[_  X  -I  ^  [_a-|,  for  all  aeT' 

aX-X  abX  ,  for  all  a,b,ceT’ 

be  c 

aX^ — I  ^  ab — I  ,  for  all  a,b£T’  . 
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For  each  production,  Z  in  P',  P  also  contains 

the  production,  Z  ->  Y^...Y^,  where 

Y.  =  Y.’  if  Y.’eN’  and  Y.  =  X  if  Y.'  =  aeT ' . 

11  1  1  a  1 

For  convenience,  define  the  mapping 
h(Y)=rx  ifY=  aeT ' 

LY  if  Y^T ’ . 

The  domain  and  range  of  h  are  extended  to  strings  by 
defining 

h  ( e  )  =  e 

and  h(Xy)  =  h(X)h(y), 

A  derivation  in  G  starts  with  S  f~  S  ' — |  followed 
by  application  of  productions  created  from  those  in  P*. 
Once  a  new  production  (involving  some  X  on  the  left  side) 
has  been  used,  only  new  productions  may  be  used  if  the 
derivation  is  to  be  canonical. 


If  the  productions  in  P'  have  the  apply  symbols 
AO,  Al,...,  An,  then  number  the  new  productions  in  P,  An+1, 
.  .  .  ,Am,  and  give  S  ->  j-  S  ’  — |  the  apply  symbol  Am+1.  Now 
CS(G)  =  {  h  (CS  (G  ’  )  )  u  {|-S’-^Am+l}  u  u 

with 


=  {  [-  w  — I  Ap  I  weL(G'),  aX^  — |  ^  ab  — |  ApeP  ,  and  w 
ends  in  ab  ,  or  w  =  a  and  [-  X  — |  ->  [—  a— jApeP} 

Si 


and 
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L  =  {  1-  wAp  1  w  =  w'bcX  and  w'bcayeL(G')  for 

2  3. 

some  y  and  bX  X  bcX  ApeP}, 

Clearly,  and  h ^  are  related  to  L(G’).  The 

following  lemmas  establish  the  connection. 

L  emma  7 . 1 

(i)  L^/{Ap  I  n+l<p^m}  =  {[—  }h(G’){-|} 

(ii)  is  regular  if  and  only  if  L(G')  is  regular. 

Proof 

(i)  True  by  definition. 

(ii)  Regular  sets  are  closed  under  right  and  left  quotient 
(Hopcroft  and  Oilman  1969,  Salomaa  1973). 

Note  that 

L(G’)  =  (  (  {  |-  }\L^  )  /  {  Ap  I  n+ l<p<m}  )  /  { -|  } 

If  is  regular,  then  L(G’)  is  regular. 

If  L(G')  is  regular,  it  is  reco gn i zed  by  a  finite  state 
machine.  Since  the  apply  symbol  at  the  end  of  each  string 
in  depends  on  at  most  the  preceeding  three  symbols,  the 
finite  state  recognizer  for  L(G')  can  easily  be  modified  to 
recognize  .  Thus  is  a  regular  set.  □ 

Let  us  de  f ine 

=  {wX^  I  waePREFIX(x)  for  some  xeL(G’)  and 

I  w  I  ^  2  }  . 
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Clearly,  \  n+l<p<iii}  = 

L  emma  7 . 2 

is  regular  if  and  only  if  is  regular. 

Proof 

The  proof  is  virtually  the  same  as  in  part  (ii) 
of  Lemma  7.1  above.  □ 

Lemma  7 . 3 

If  L(G')  is  regular,  then  is  regular. 

Proof 

If  L(G')  is  regular,  it  is  recognized  by  some  FSM, 

M.  M  may  be  modified  so  that  each  state,  q,  belongs  to  onl 
one  of  the  following  sets; 

(i)  q  can  only  be  reached  after  reading  at  least  two 
symb  o 1 s . 

(ii)  q  is  a  start  state  or  a  state  that  can  only  be 
reached’by  reading  a  single  symbol. 

If  q  belongs  to  the  first  set  of  states,  then  a  new  transit: 
X  ,  is  added  if  q  has  a  transition  on  the  symbol  a.  The 
destination  of  any  new  transition  is  a  new  state  without 
any  transitions.  This  new  state  is  the  only  final  state 
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of  the  modified  FSM.  Clearly,  the  new  FSM  accepts  and, 

hence,  is  a  regular  set.  □ 

Lemma  7 . 4 


CS(G)  is  regular  if  and  only  if  u  is  regular. 

Proof 

Since  G'  is  a  CFG,CS(G')  is  regular.  Hence, 
by  various  closure  properties  (Hopcroft  and  Oilman  1969), 

{  |-}h(CS(G’ )  )  {[-S'HAm+l}  is  regular.  If  u  is  regular, 

then  GS(G)  is  also  regular,  since  regular  sets  are  closed 
under  union. 

Assume  that  CS(G)  is  regular.  Now 
u  L2  =  CS(G)n  ({|-}h(CS(G'))  u  {  |-S  *-1  Am+ 1 } ) 

and  by  various  closure  properties,  u  L2  is  regular.  □ 

Lemma  7 . 5 

L^  is  regular  if  and  only  if  L^  u  L^  is  regular. 

Proof 

All  strings  in  L^  end  with  a  symbol  for  a  production 
of  the  form  aX^— |  ^  ab— j  or  [_X^— |  ->  |_a-|.  On  the  other  hand, 

strings  in  L2  end  with  apply  symbols  for  productions  of  the 
form  \-  X^X^  ->  [-  aX^  or  aX^X^  ^  abX^.  Thus  no  string  from  L^ 
ends  with  the  same  apply  symbol  that  terminates  a  string 
in  L 2  . 

7  _  1  "7 
/  -1-  / 


If  u  L2  ^  regular  set,  it  is  recognized  by  a 

finite  state  machine.  The  final  states  of  the  finite  state 
machine  are  accessed  by  apply  symbols.  If  we  alter  this 
machine  and  define  states  accessed  by  apply  symbols  that 
terminate  strings  in  to  be  the  only  final  states,  then 

the  new  finite  state  machine  recognizes  L^.  Thus  is  a 

regular  set. 

If  is  regular,  we  may  apply  previous  lemmas  to 

show  that  L(G'),  L^,  and  are  also  regular.  Hence  u 

is  a  regular  set. 

Consequently,  is  regular  if  and  only  if  u 

is  regular  .  □ 

Finally  we  have 

Lemma  7 . 6 

CS(G)  is  regular  if  and  only  if  L(G')  is  regular. 


Proof 

If  L(G’)  is  regular,  then  the  above  lemmas  show 
that  L^,L^  u  L2>  and  CS(G)  are  regular. 

If  GSCG)  is  regular,  then  the  previous  lemmas 
show  that  u  and  L(G’)  are  all  regular  sets.  □ 
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Consequently,  the  result  we  have  been  seeking  is 
easily  established. 

Theorem  7.4 

It  is  undecidable  for  an  arbitrary  CSG  or  TOG,G, 
whether  G  is  regular  parsable. 

Proof 


If  this  question  is  decidable,  we  may  determine  whether 
our  constructed  grammar  is  RP .  Since  that  grammar  is  RP  if 
and  only  if  the  CFG,G’,  is  regular,  we  have  an  algorithm  to 
decide  whether  an  arbitrary  CFG  is  a  regular  set.  However, 
this  question  is  known  to  be  undecidable  (Hopcroft  and  Ullman 
1969)  .  □ 

If  the  CFG,G’,  had  any  useless  productions,  there 
exists  an  algorithm  to  remove  them  (De  Remer  1969).  Thus 
we  may  assume  that  G’  has  no  useless  productions.  Under 
these  circumstances  we  may  observe  that  for  our  constructed 
grammar,  G,  RCS(G)  =  CS(G).  Hence  an  immediate  corollary 
of  the  last  theorem  is 


Co  r o 1 1  ary 


It  is  undecidable  whether  an  arbitrary  CSG  or 
TOG,  G  is  RRP . 
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Proof 


For  our  constructed  grammar,  RCS(G)  =  CS(G).  Thus 
G  is  RRP  if  and  only  if  G  is  RP  .  The  corollary  follows 
immediately .  □ 


Closure  Properties  of  DRP  Languages 

DRP  languages  are  closed  under  complementation. 

To  establish  this  result,  we  will  use  another  construction, 

Theorem  7 . 5 

Let  L  be  any  DRP  language  over  a  set  T.  L  is  also 
a  DRP  language . 

Proof 


Since  the  proof  is  rather  long,  we  will  give  a 
s  ummary  first  . 

If  G  is  a  DRP  grammar,  we  construct  a  D2SM  that 

•  Ic  k.  1  • 

accepts  an  input  w$  i  if  and  only  if  w  is  a  sentential 

form  of  G.  $  is  a  new  nonterminal  symbol.  This  D2SM  is 

modified  so  that  it  will  accept  inputs  that  it  would 

previously  have  rejected.  A  grammar  is  then  constructed 
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from  the  modified  2SM.  The  grammar  will  generate  the 
language  L ( G )  and  it  will  be  DRP . 


Construction 


Let  G  =  (N,P,T,S)  be  any  DRP  grammar  and  assume 

that  the  set  ACS(G,k)  satisfies  the  DRP  definition. 

“  (N,,T.,P  ,S.)  is  the  usual  k-augmented  grammar  derived 

A.  A  A  A  A 

from  G .  Thus 

Na  =  N  u  {Sa} 

Ta  =  T  u  { 1 } 

P^  =  P  u  {S^  ^  Si^AO}. 


The  first  part  of  the  construction  will  transform 

Ic  Ic  Ic  ^  1 

G.  so  that  if  S  =>*  wl  then  w$  i  will  be  generated  by  the 


new  grammar.  $  is  a  new  nonterminal.  When  we  build  a 
2 SM  based  on  the  new  grammar  we  will  be  certain  that  if 
will  not  accept  any  sentential  forms  of  G^ . 


First,  we  define  a  mapping,  h,  from 
(N^  u  T^)  to  (N^  u  T  u  {$})  : 


h  (Y) 


,  if  Y  1 


L$  ,  if  Y  = 


As  usual,  ve  extend  the  definition  of  h  so  that 
h  (e )  =  e 

and  h (Xy)  =  h  (X)h (y )  • 

Let  G^  '  =  A  '  ’  A  '  ’  ^  A  '  ’  ^  A  '  ^  new  grammar  with 
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N^'  =  N  u  {$,S^',S^"},$,S^',S^'V(N^  u  T^) 

T  ’  =  T  u  {l } 

A 

P  *  =  {s'  S  S,"  ^  S$^As} 

A  A  A  A 

u  (xt  ->  ytAp^  I  X  ->•  yApeP  ,  (wyAp  ,  t  '  )  £ACS  (G  ,k)  , 
and  t  =h(t’)}. 

G  '  is  the  k+l-augmented  grammar  constructed 
from  G'  =  (N  u  { S /',$},  T  ,  P S  " )  where 
P'  =  ^  S^''x^‘^^AO}  . 

Since  G  is  DRP ,  it  is  clear  that  G'  is  also  DRP  and  that 
the  set 

{(uAp,e)  I  uApeCS(G^')} 

is  a  set  of  k+l-augmented  characteristic  strings  for 
the  grammar  G'.  In  this  case,  the  look  ahead  strings 
are  all  empty.  From  Theorem  5.2,  we  know  that  CS(G^')  is 
a  regular  set. 

We  can  construct  a  D2SM,  M,  with  the  error  property 

directly  from  CS(G^').  This  2SM  is  M ( G ' , C S ( G^ ' ) )  in  the 

terminology  of  Chapter  3.  By  construction,  M  will  have  no 

inadequate  states  and  will  have  only  accept,  read,  and 

k  k+ 1 

reduce  instructions.  Moreover,  M  accepts  an  input  w$  i 
if  and  only  if  w  is  a  sentential  form  of  G.  Since  $  is  in 
N  ’,  the  language  accepted  by  M  is  empty. 

A 

Before  constructing  a  grammar  from  M,  we  must 

-  k+ 1 

modify  it  so  that  it  accepts  the  language  L(G){l  }. 
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Only  read  states  will  be  altered.  The  changes  will  require 
some  new  states  and  symbols.  New  states  will  be  written 
in  the  form  s(...)  and  new  symbols  in  the  form  R(...).  None 
of  the  new  symbols  will  be  in  the  terminal  set. 

Let  q  be  any  read  state  in  M  such  that 

(i)  there  is  no  instruction  qi  ^  (l,q)q’ 
applicable  to  q,  and 

(ii)  if  there  is  an  instruction  q$  ($,q)q'  then  q 
is  accessed  by  a  symbol  other  than  $. 

If  there  is  no  instruction  q$  ($,q)q',  then 

define 

E  =  {y|  there  is  no  instruction  qY  ->■  (Y,q)q'' 
applicable  to  q  and  Ye (N  uT)}  u  {e}. 

Otherwise , 

E  =  {y|  there  is  no  instruction  qY  (Y,q)q" 
applicable  to  q  and  Ye (N  uT)}. 

For  each  YeE  add  the  instructions 
qY  ^  (Y,q)s(Y,q) 

(Y  ,  q)  s  (Y  ,  q)  qR  (Y  ,  q)  . 

In  particular,  if  eeE  there  are  instructions 
q  (e,q)s(e,q) 

(e,q)s(e,q)  ^  qR(e,q). 

Although  q  may  have  become  inadequate  with  the 
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(e,q)s(e,q),  the  modified 


addition  of  the  instruction  q  -> 

2SM,M’,  will  be  designed  so  that,  if 

uqv  [-  u  ( e  ,  q )  s  ( e  ,  q )  V  |-*ACCEPT 

M’  M' 

k+ 1  . 

then  V  =  1  .  Since  q  does  not  have  a  read  instruction 

of  the  form  qi  ^  (l,q)q’,M’  could  be  made  deterministic 
by  adding  look  ahead. 

Having  entered  a  state  s(Y,q),M'  will  know  that 

M  would  have  rejected  the  input.  Furthermore,  M'  will 

always  enter  such  a  state,  for  any  input  that  would  have 

been  rejected  by  M.  The  rest  of  the  modifications  allow 

M’  to  accept  these  "erroneous”  inputs.  Thus,  if  W£ (N  u  T) 

and  w  is  not  a  sentential  form  of  G,  then  M*  will  accept 
k+  1 

wi  .  If  the  terminal  set  of  M  is  chosen  to  be  T,  then 

—r—r  k+ 1 

M'  accepts  the  language  L  (G)  {i  }. 


Modifications  to  Allow  Acceptance  of  "Erroneous"  Inputs 

Informally  once  M’  enters  a  state  s(Y,q),  we  want 

M’  to  perform  a  series  of  reductions  so  that  the  contents 

of  the  L-STACK  are  reduced  to  a  single  symbol.  Having 

accomplished  this,  M’  will  then  accept  any  string  ending 
k  + 1 

in  1  that  it  may  find  in  the  R-STACK.  However,  for  a 

k+ 1  , 

state  s(e,q),  the  R-STACK  can  only  contain  l  if  the  inp 
is  to  be  accepted. 
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If  q  is  the  start  state  then  add 

q^R(Y,q^)  ^  ( R ( Y , q ^ )  ,  q ^ ) s  ’  ( R ( Y , q ^ ) )  . 

Otherwise  q  is  accessed  by  the  symbol  Z.  Let  q"  be  any 
state  with  an  instruction 
q”Z  (Z,q'')q. 

We  add  the  instructions 

qR(Y,q)  ^  (R ( Y , q )  ,  q ) s ( R ( Y , q )  , Z ) 

(Z,q") (R(Y,q) ,q)s(R(Y,q) ,Z)  ^  q"R(Y,q). 

If  q"  is  the  start  state  q^,  we  add 
q^R(Y,q)  ^  (R ( Y , q )  ,  q ^ ) s  ’  (R ( Y , q ) ). 

If  not,  then  q”  is  accessed  by  some  symbol  Z'  from 

another  state  q’"  with  the  instruction 

q’''Z’  ->•  (Z',q"')q".  Thus  we  add  the  instructions 

q''R(Y,q)  ^  (  R  (  Y  ,  q  )  ,  q  "  )  s  (R  (  Y  ,  q  )  ,  Z  '  ) 

(Z’  ,q’")  (R(Y,q)  ,q")s(R(Y,q)  ,Z’)  ^  q'”  R(Y,q). 

Although  s(R(Y,q),Z')  may  have  been  created  before,  it 
is  always  an  adequate  reduce  state  by  construction.  This 
construction  is  repeated  for  all  states  that  access  q  ’  ”  . 
Eventually,  the  accessing  state  must  be  the  start  state. 

For  state  s'(R(e,q)),  we  add 
(R(e  ,q)  ,q^)  s  ’  (R(e  ,q)  )  q^R’(e,q) 

q^R’(e,q)  ^  (R  ’  ( e  ,  q  )  ,  q  ^  )  s '' (  R  ( e  ,  q  )  ) 
s”(R(e,q))l  ^  (l,s”(R(e,q)))s(l) 

s(l)i  4.  (i,s(l))s(2) 

« 

s(k)i  ^  ( 1 , s  (k 1 ) ) s (k+ 1 ) 
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(R*  (e  ,q)  ,q^)  (l  ,s'' (R(e  ,q)  )  )  (i  ,s  (1)  )  ...  ( i  ,  s  (k )  )  s  (k+ 1 )  ^ 

q,  R  .  ACCEPT  . 

A 


For  state  s’(R(Y,q>)(Y  e),  add 

(R(Y,q)  ,q^)s’  (R(Y,q))  ^  q^R’(Y,q) 
q^R’(Y,q)  ->  (R  '  (  Y  ,  q  )  ,  q  ^  )  s  ”  (  R  (  Y  ,  q  )  ) 
s''(R(Y,q))l  (l,s"(R(Y,q)))s(l) 

s(k)i  ( 1  ,  s  (k )  )  s  (k+ 1 ) 

(R'  (Y,q)  ,qp  (l,s"  (R  (  Y  ,  q  )))( i  ,  s  (  1  ))...( 1  ,  s  (k  ))  s  (k+ 1  )  ->  q^R^ 

q,R.  ->  ACCEPT 
A 

and 

s'(R(Y,q))Z  ->  (Z,s’(R(Y,q)))s"'(R(Y,q),Z) 
(R(Y,q),q^)(Z,s’(R(Y,q)))s”'(R(Y,q),Z)  ->  q^R(Y,a) 
for  all  ZeN  u  T  . 

Although  s’(R(Y,q))  is  an  inadequate  state,  M' 
is  designed  so  that  if 

(R(Y,q),q.)s'(R(Y,q))v  |-  q.  R  ’(Y,q)v  |-*  ACCEPT 

M’  M’ 

k+  1 

then  V  =  1  .  Since  there  is  no  read  instruction 

s’(R(Y,q))i  ->  ...  ,  this  inadequacy  could  also  be  resolved 

by  look  ahead . 


Construction  of  a  Grammar  from  M’ 


Let  the  new  2SM  be  M’  =  (K,V^,V^,T  u 

L  K 
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We  construct  a  grammar  '  from  M’.  As  usual, 

G,'  =  (K  u  u  (V^-T-  {l})  U  {A},T  u  {l},P^',A) 

M  L  K  m 

with 

P  ’  =  {x-^y  I  y^xel  and  y  ^  x  is  not  an 

M 

accept  instruction}  u 
{A  ^  y  I  y  ^  ACGEPTel}. 

Productions  in  P„'  are  derived  from  read,  reduce  and 

M 

accept  instructions  only. 

M’  is  an  error  property  2SM  without  look  ahead.  Thus 
Lemmas  5.1,  5.2,  5.3,  5.4,  and  5.5  are  still  valid  and  G^^  ’ 
is  RP  .  Gj^'  is  also  DRP  .  To  prove-this,  consider  the 
1-augmented  grammar,  G^^^  '  =  ,  T^^  ,  P^^  ,  A  '  )  constructed 

from  G^’.  The  goal  production  of  ’  is  A'  ->  A— |  AO  where 

H  is  the  goal  post  used  in  G  ’. 


By  construction. 


(i)  for  a  production  (e,q)s(e,q) 
A  =>*  u ( e , q ) s ( e , q ) V  =>  uqv 


qAl  , 


if  and  only  if  uq  is  a  state  string  of  Mj  and  v  =  i  . 
Any  other  production  with  q  on  the  right  side  is  of 
the  form  (Z,q)q'  qZAl'  and  Z  ^  e.  Thus,  if 
A  =>*  u(Z,q)q'v  =>  uqZv, 

°m’  s’ 

uqZ  A 1  ’  e  G  S  ( Gj^  '  )  .  Informally  then,  any  characteristic 
string  uqAl  can  be  distinguished  from  uqZAl’  by  the 
use  of  the  look  ahead  string,  i,  with  uqAl; 
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(ii)  for  a  production  q^R'(Y,q)  ^  ( R ( Y  ,  q )  ,  q ^ ) s  '  ( R ( Y , q ) ) A  ] 

A  E>:*:  q  R'(Y,q)v  E>  (  R  (  Y  ,  q  )  ,  q  )  s  '  (  R  (  Y  ,  q  )  )  v 
r  '  ^  P  t  -L 

•  Ic 

if  and  only  if  v  =  i  .  Any  other  production  with 
s'(R(Y,q))  on  the  right  side  is  of  the  form 

(Z , s ’  (R(Y,q) ) )s  » " (R(Y,q) )Z)  ^  S  '  (R ( Y  ,  q ) ) Z A  1 ’  . 

Once  again,  the  look  ahead  string,  i,  will 
distinguish  the  characteristic  string 

(R(Y,q)  ,q^)s  '  (R(Y,q))Al  from  (R (Y , q)  , q  ,s ’  (R (Y , q) ) ZA] 
Using  these  observations,  we  can  describe  a  set  of 
I'-augmented  characteristic  strings  for  G,, ,  that  satisfies 
the  DRP  definition. 


The  productions  of 
disjoint  categories: 


can  be  divided  into  three 


(i)  =  {A’  ->  A  -I  AO}  ; 

(ii)  P2  "  {(e,q)s(e,q)  ^  q  |  q  ^  (  e  ,  q ) s  (  e  ,  q ) €  I } 

U  {q^R’(Y,q)  ->  (R  (  Y  ,  q  )  ,  q  ^  )  s  ’  (R  (Y  ,  q  )  )  | 

(R(Y,q)  ,q^)s'  (R(Y,q))  ->  q  ^  R  ’  (  Y  ,  q  )  e  I  }  ; 


(ill)  P^^  P^  P^ . 


Only  productions  in  need  non-empty  look  ahead  strings 
to  make  ’  DRP.  In  all  cases  the  look  ahead  string  is  1. 


Thus  the  set 


ACS(G^',1)  =  {(uAp,t)  I  If  production  p  is  in  P^ 

t  tl 

or  P^  then  t  =  e.  If  the  p  product! 
is  in  P^,  then  t  =  1} 
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is  a  set  of  1-augment  ed  characteristic  strings  for  . 

From  our  observations  above  and  the  fact  that  ’  is  RP ,  it 
is  clear  that  this  set  satisfies  the  DRP  condition. 


Construction  of  a  DRP  Grammar  for  L ( G ) 


The  final  step  in  the  construction  will  yield 

a  DRP  grammar  that  generates  the  language  L ( G ) .  This  new 

grammar  is  derived  from  G^  ’  .  Recall  that^  for  some 
k+ 1  . 

we  (N  uT)*,wl  is  a  sentential  form  of  G..  ’  if  and  only 

M 

if  w  is  not  a  sentential  form  of  G. 


or 


A  derivation  in  G.,  ’  must  begin  with  either 

M 

A  i>'^  q,S 
^1  A 

_  +  k+ 1 

A  =>  q^R’(Y,q)i  for  some  new  nonterminal 


R'(Y,q).  The  productions  used  in  these  derivations 
cannot  be  used  in  any  subsequent  steps  of  the  derivation. 
Furthermore,  they  all  belong  to  the  set  of  productions, 

P^,  defined  above.  Let  us  define  P^  to  be  the  set  of 
productions  that  are  used  in  any  of  the  derivations  above. 

where 


M 


M  ’  M  ’  M 


Nm"  =  Nm'  " 

m  M  =  T*  ^ 

M  M 
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u  {a  ^  qS"} 

1  A 

u  {A  ^  q^R' (Y,q)  |  R’ (Y,q)  is  a  new  nonterminal 

created  by  the  modifications  to  M} . 

G^”  is  a  k+ 1-augmented  grammar  for  the  grammar 
G"  =  (N",T",P",A)  where 

N"  =  N^''-{A"} 

M 

T"  = 

N 

P"  =  P^/'-{A"  ->  Ai^'^^AO}. 

M 

Since  L(Gj^”)  =  L(G^')  = 

L(G")  =  LT^. 

Finally  we  must  show  that  G”  is  DRP .  The  set  of 

characteristic  strings  of  G^^"  is 

CS(Gj^")  =  {Ai^AO,Ap} 

u  (C S (GM ' ) - {wAp  I  wApeCSCG^’)  and 
t  ll 

the  p  production  was  in  P^}). 

In  Theorem  5.2,  we  showed  that,  for  any  DRP  grammar  G, 
the  set 

CS(G,Ap)  =  {wAp  I  wApeCS(G)} 
is  regular.  Thus  the  set 

{wAp  I  wApeCS(Gj^’)  and  the  p^^  production  is  in  P^} 

is  regular.  By  various  closure  properties  of  regular  sets, 

CS(G..'*)  is  regular  and  G^."  is  RP  . 

M  M 
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The  set 


ACS(G",k+l)  =  {  (Ai^'^^AO  ,e)  } 


U  { ( Ap ,X)  I  A  =>*  q  Xw} 

r  ’  ^ 


u  {(qAp,t)  I  (wAp , t ) e AC S (G^ ' )  and 
t  ll 

the  p  production  is  not  in  P^} 
is  a  set  of  k+ 1-augment ed  characteristic  strings  for  G" . 

It  satisfies  the  DRP  definition  and  so  G"  is  DRP .  Consequently 
L  (G.)/  =  L(G'’)  is  a  DRP  language. 


Now,  if  L  is  a  DRP  language,  L  =  L(G)  for  some 
DRP  grammar  G.  Using  this  construction  we  can  construct 
another  DRP  grammar  G’  such  that  L  =  L (G)  =  L(G’)»  Thus 

L  is  DRP  as  required.  □ 

The  following  corollary  will  be  useful  in  proving 
some  undecidability  results. 

Corol lary 

Let  M  be  any  D2SM  with  the  error  property.  There 
exists  a  D2SM,M’,  with  the  error  property  such  that 
L(M')  =  lTFTT.  There  is  an  algorithm  to  construct  M'  from 

M. 


Proof 

This  result  follows  from  several  previous  lemmas. 
Given  M,  we  can  effectively  construct  a  DRP  grammar,  G,  such 
that  L(G)  =  L(M)  (Theorem  5.7).  Moreover,  we  can  construct 
a  set  ACS(G,0)  that  satisfies  the  DRP  definition. 
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Thus,  we  can  create  a  D2SM,  M" ,  with  the  error  property 
such  that  L(M")  =  L(G)  =  L  (M)  .  M'*  will  have  read,  reduce 

and  accept  instructions  only.  In  the  proof  of  Theorem  7.5 
we  showed  that  we  can  construct  from  M” ,  another  D2SM,M', 
with  the  error  property  such  that  L(M’)  =  L (M” )  =  L (M) . 

Every  construction  we  have  described  can  be  carried  out  as 
an  algorithm.  □ 

Before  investigating  the  decidability  of  various 
questions  about  the  DRP  grammars,  we  will  prove  another 
closure  property  that  will  be  useful  in  the  next  section. 

Lemma  7 . 7 


Let  =  (N^,T^,P^,S^)  and  G^  =  (N ^  ,  T 2  ,  P 2  ,  S ^ ) 

be  any  two  DRP  grammars  such  that  T^  n  T^  =  (j)-and  let 
@  be  a  new  terminal  symbol  such  that  @)^(T^  u  ^2)  •  The 
language  L ( G ^ ) { @ } L ( G2 )  is  DRP. 

Proof 

Without  loss  of  generality,  we  will  assume  that 

n  N2  =  (D. 

Consider  the  grammar 

G3  =  (N^  u  N2  u  {S3},T^  u  T2  u  {@},P^^S3)  such  that 
83,0^ (N^  u  N2  u  T^  u  T2  ) 

and 

P3  =  {S3  ^  3^082}  u  P^  u  P2. 
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Clearly,  LCG^)  =  L ( G ^ ) { @ }L ( ) .  We  can  show  that 

G^  is  DRP .  Informally,  @  is  now  used  as  a  goal  post  for 

the  characteristic  strings  of  G^. 

Assume  that  the  sets  ACS(G^,k^)  and  ACS(G2,k2) 

satisfy  the  DRP  definition.  Consider  any  look  ahead 

string,  t,  in  ACS(G|l>^j^)  such  that  t  =  t'i™,m  >  1  (if 

such  a  string  exists).  If  (wAp  ,  t )  e  AC  S  (G^^k^  )  and 

(w ' Ap ’ , t" ) eACS (G^ ,k^ )  then  w t ^PREF IX (w ’ t ” )  and 

w  '  t'VPREFIX  (wt )  .  Surely,  w  t  '  l  P  RE  F IX  ( w  ’  t  ”  )  and 

w  '  t'VPREF IX (wt ’ 1 )  .  Consequently,  we  may  assume  that  if 

t  is  any  look  ahead  string,  then  t£(N^  u  T  or  t  =  t’l 

and  t’£(N  u  T)* 

1  1  ' 

Let  k^  be  the  maximum  of  k^^  and  k^and  let  G^  be  the 

lc,-augmen ted  grammar  derived  from  G  .  G  and  G  are 

j  A  A  2 

the  k^-  and  k^'augmented  grammars  for  G^  and  G^  respectively. 

For  ease  of  notation,  we  define  a  mapping  from  (N^  u  u{l}) 

to  (N^  u  u  {@}) : 

h(X)  =  Tx  if  X  1 
if  X  =  1. 

This  mapping  can  be  extended  to  strings  in  the  usual 
way  by  defining 
h  (e )  =  e 

and  h(Xy)  =  h(X)h(y). 

We  will  also  assume  that  productions  in  G^  and  G^  are  given 
different  apply  symbols  but  that  the  corresponding  productions 
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in  have  the  same  apply  symbol. 

Consider  the  set 

k  ^ 

ACS(G3,k3)  =  {(S3i‘'3^o,e)}  u  {(S^eS^Ap’,!  ■" )  } 

u  {(S^@wAp,t)  1  (wAp , t) eACS (G2 

-{(S2i^2AO,e)}} 

u  {(wAp,t)  I  (wAp , t ' ) 6 AC S ( G ^ , k 3 )  - 

{  (  AO  ,  e  )  }  and  t£h(t’)}. 

Clearly,  since  ^  ^  2  ~  ^1  ^2  ~  ^  replaced 

1  in  the  look  ahead  strings  from  ACSCG^jk^),  this  set  satisf 
the  DRP  definition.  Therefore,  G^  is  DRP  and  so  LCG^)  is  a 
DRP  language  as  required.  □ 


Undecidability  Results  for  DRP  Languages 


Since  all  LR(k)  languages  are  DRP,  any  undecidable 
question  for  LR(k)  languages  ia  also  undecidable  for 
general  DRP  languages. 

Theorem  7 . 6 

Let  G^  and  G^  be  any  two  DRP  grammars .  The 
following  questions  are  undecidable. 

(i)  Is  L(G^)  n  LCG^)  =  0? 
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(ii) 

Is 

L(G^) 

n 

L(G2) 

a  CFL? 

(iii) 

Is 

L(G^) 

u 

KG^) 

an  LR (k ) 

1 angu ag e  ? 

(iv) 

Is 

L(G^) 

c 

L(G2) 

9 

(v) 

Is 

L(G^) 

n 

L(G2) 

an  LR (k ) 

1 anguage  ? 

(vi) 

Is 

L(G^) 

u 

L(G2) 

=  T*  where  T  is  the 

t  e  rmina 1 

s ; 

ymb  o 1 s 

for  G,  and  G„? 

common  set  of 


Proof 

All  these  questions  are  unde c i dab  1 e  for  LR(k) 
grammars  (Hopcroft  and  Ullman  1969).  If  they  were  decidable 
for  DRP  grammars  then  they  would  be  decidable  for  LR(k) 
grammars  too .  □ 

A  common  method  of  proving  decidability  results 
is  to  devise  a  construction  that  will  transform  any  instance 
of  one  question  to  an  instance  of  another  question.  If  the 
first  question  is  known  to  be  undecidable,  then  the  second 
will  also  be  undecidable.  This  method  of  attack  is  used 
in  this  chapter . 

Most  questions  concerning  DRP  grammars  are  undecidable. 
The  following  construction  will  create  a  DRP  grammar  from 
any  instance  of  Post’s  Correspondence  Problem  (Hopcroft  and 
Ullman  1969).  The  grammar  will  be  empty  if  and  only  if  the 
instance  of  Post's  Correspondence  Problem  has  no  solution. 


Let  the  lists  X  =  . ,x  )  and  Y  =  (y,,...,y  ) 

in  in 
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be  an  instance  of  Post's  Correspondence  Problem  (PCP).  Each 
string  X,  and  is  in  the  set  for  some  alphabet  T.  By 

definition,  this  instance  of  PCP  has  a  solution  if  there 
exists  a  sequence  of  integers  i^,...,i^  (m>l)  such  that 


Our  strategy  is  to  construct  a  DRP  grammar,  G  =  (N’,T',P’,S^ 

from  this  instance  of  the  problem.  G  will  be  designed  so 
that  L(G)  =  0  if  and  only  if  PCP  with  lists  X  and  Y  has  no 
solution.  As  a  result  there  can  be  no  algorithm  to  determin 
whether  an  arbitrary  DRP  grammar  generates  an  empty  language 


Before  describing  G's  productions,  we  define  some 

t  h 

notation.  For  the  i  element  in  either  list.  A., A. ’a.,  an 

111 

a.  '  are  in  the  vocabulary  of  G.  A,  and  A, ’  are  nonterminals 
1  11 

corresponding  to  x.  and  y. .  a.  and  a, ’  are  terminals.  The 
elements  of  T  are  denoted  by  bj,l<j<m.  Corresponding  to  ea 

b.  is  a  nonterminal  B..  The  terminal  set  of  G  is 

J  J 

T  u  {a.  I  l<i<n}  u  {a.*  1  l<i<n}. 

The  nonterminals  of  G  include 

{a,  I  l<i<n}  u  {a.’  I  l<i<n}  u  {B.  I  l<j<m}. 

Other  nonterminals  will  be  introduced  as  they  are  required. 


The  productions  of  G  fall  into  several  groups 


The  first  group  is 


(A)  : 


A 

S  ^  S 


S  ^  S^C . 
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y -eY, 

,  r 

let  y  . 

b  .  .  . 

.b  . 

1 

i. 

J- 

m  . 

1 

=  {b  . 

j 

1  B  .  B  .  } 

J  h 

u  {Bj 

Bj  1 

1  -J 

u  {B  . 

1 , 

.  •  .  B  .  B  . 

i„  J 

1  B  • 

J 

^  B  . 
1 

m 


i-1 


m 


For  every  string,  there  are  productions 


(B):  S,  ^A.S.B.  ...B, 

1  ill,  1 

1  m 


S,  A.’B.  ...B, 

1  1  ii  1 

i  m . 

1 


P  also  includes  the  productions 


(C)  :  S^C  S^S^C 

S„  A.S„,  for  all  i,l<i<n. 

3  1  3 

S^Bj  ,  for  all  j,l<j<m. 

S,  C  ^  C 

4 

S.B.  ->■  B.,  for  all  j,l<j<ni. 

4  j  j  >  j  >  j 

For  every  y^  =  b.  . . .b .  in  Y,  there  are  productions 

m .  1 

1 

(D)  :  ->  A^S^y’,  for  all  y'eS(y^^). 

^  for  all  y'eSCy^’^). 


Final ly ,  P 

includes 

the 

following  groups  of 

productions 

(E):  A^b 

.  ^  b  ,  A  .  , 
J  J  1 

for 

all  i 

and  j  ,  l<i<n , 

i<  j  <ni . 

(F):  A. 4 

.  a  .  A  .  , 

a  J  1 

for 

all  i 

and  j  ,  1  < i <n , 

1< j  <n . 

(G) :  A. a 

1 

.  ’  ->■  a  .  A  . 
J  J  1 

’ ,  for  all 

i  and  j,l<i<n 

,  i<j<n. 

(H) :  for 

each  X • 

=  b  . 

•  •  •  • 

,  in  X,  P  contains 

n . 
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A  . 


A.  '  B  . 

->■ 

b  .  A. 

^  ^1 

^1  ^1 

A  .  B  , 

b  .  A. 

If  I2 

• 

• 

I2  I2 

A  .  B  , 

^  b  .  A. 

1  .  1  .  1 

3  J  +  1 

1-^1  1  • 

J  +  1  J 

b  .  a  .  ' 

.  - 1  1 

1  n 

1  1  ■ 

n  . 

Although  there  are  many  productions,  the  structure  of 
the  derivations  is  surprisingly  simple.  Any  derivation  begins 
with  either 


S,  E>  S  E>  S, 
A  1 


o  r 


E>  S  E>  S^C. 


In  the  latter  case,  a  terminal  string  will  never  be 
derived  since  the  nonterminal  C  does  not  appear  on  the  left 


side  of  any  production.  The  sentential  form  will  derive 


other  sentential  forms 


...A,  2,  ...z*  j  m— 1 , 


Jm  Jm 


where 


=B.  ...B.  ify. 

m .  k 

1 


”  b.  ...b# 

1  m , 


and  y .  e Y . 
J 


k  1  ^m.  ""k  ^m.  k 

1  1 

An  interesting  feature  of  G  is  illustrated  in  the  following 
lemma . 


Lemma  7 . 8 


be  in  { B . 

J 


Let  u  be  in  {A^  |  l<i<n}*{A^’  |  l<i<n}  and  let 

A 


l<j<m}'^.  S  =>*  uv  if  and  only  if  S  does  not 

A. 


derive  uvC . 
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Proof 


We  will  only  give  an  informal  proof  of  this  lemma. 


If  uv  is  of  the  form  A.  ...A,  ’z,  ...z,  as  above, 

•^1  -^m  -^m  •^1 

then,  clearly,  the  productions  of  the  group  (B)  result  in 

the  derivation 

S  =  >*  uv . 

A 


If  xy  is  not  of  this  form,  then  some  production  in  group 
(D)  must  have  been  involved.  In  particular,  uv  must  be  in 
the  form 


^1 

A . u^v^ V ’ 
1  2  2 

^1’ 

where  u^v^ 

=  e  o  r 

u  ^ Vo  =  A . 

A  '  7 

•  •  •  ^  • 

. . z .  as  above. 

2  2 

2  2 

J  J 

Jl 

v'  €  s  (yj^^)  , 

Ui^ { A^ 

1  1 < i <n } * , 

and  v^£{Bj 

1  1< j  <m}* .  This 

string  can 

only  be 

generated 

by  the  use 

of  productions  in 

(C),  then  one  in  (D),  followed  by  the  use  of  some  productions 
in  (B).  Thus,  in  any  sentential  form,  uv  is  followed  by  C  as 
required.  □ 


The  importance  of  this  lemma  is  that  any  uv,  as 
defined  above,  can  be  derived  followed  either  by  a  C  or 
nothing.  This  result  will  ensure  that  G  is  RP . 

Having  derived  uv  or  uvC  the  grammar  can  now  employ 
the  productions  in  groups  (E),(F),(G),  and  (H) .  Informally, 
these  productions  allow  the  A^'s  to  move  to  the  right.  When 
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encounters  a  symbol  of  the  form  ,  it  begins  to  "check" 


that  the  following  symbols  make  up  the  string  B.  ...B. 

^1  ^n 


whei 


X  . 
1 


-  b,  ...b,  and  x.eX.  If  this  is  the  case,  each  B. 

1  T  1  1  .  IS 

1  J 


changed  to  its  corresponding  terminal,  b.  ,  and  A.  becomes 

X  •  X 

J 

a  .  ’  . 

1 


Thus  any  string  in  L(G)  is  of  the  form 
r  r  r  , 

Si  •  •■•Xa  3.#  • 

^m  -^m^m-l  ^m-1  ^1  -^1 


Furthermore,  such  a  string  is  in  L(G)  if  and  only  if 


Hence  L(G)  is  empty  if  and  only  if  PGP  with  lists  X  and  Y 
has  no  solution.  Thus  there  can  be  no  algorithm  to  determin 
if  L(G)  is  empty.  We  will  show  that  G  is  a  DRP  grammar. 


The 


strings  of  G,  ignoring  apply  symb 


can  be  divided  into  several  disjoint  sets.  It  will  be  conve 


to  use  these  sets  when  discussing  look  ahead.  The  sets 


M  =  {A^  I  l<i<n} 


and 


N  =  {  X  .  a  i 
J  3 


X .  is  in  X} 
3 


are  used  in  the  definitions  below. 


=  {S,S^,S2C} 

C2  =  {S^S^C} 

=  M*{A^S^  I  l<i<n} 

=  M*{A^S^w’  I  l<i<n  and  w’eS(y^^)} 

=  M*{A^’w’  I  l<i<n  and  w'eS(y^^)} 

C.  =  {S„S,B.  I  l<j<m} 

6  3  4  J  ' 


7-40 


1<  j  <  m} 


=  { S^c}  u  { S„B  .  I 

7  3  3  j  ' 

„  =  M*{A. S, B .  . . . B . 

8  1  1  1 T  1 

1  m . 

1 

^  =  M*{A . ’ B ,  . . . B . 

9  1  1  1 

1  m . 

1 


l<i<n  and  y.  =  b.  ...b,  } 


1  m . 

1 


l<i<n  and  y.  =  b.  ...b.  } 


m . 

1 


=  M’^N*{b.  A. 


I  l<i<n  and  x.  =  b.  ...b,  } 

11  In. 

1 


u  M*N*{b .  . . .b .  A 

"l 

u  .  .  . 


l<i<n  and  X.  =  b.  ...b,  ...b.  } 

^  ^1  ^k  ^n^ 


11 


u  M*N*{x.  a. ’  I  l<i<n} 

1  1 

=  M*N*{b,  ...b.  A  I  l<i<n,  x.^  =  b.  ...b,  ,l<j<n., 

1,  i.k  1  1  1 

1  J  In. 

1 

and  l<k<n} 


12 


13 


=  M*N*{b,  ...b.  a .  A.  l<i<n,x.  =  b.  ...b.  ,  and  l<k<n} 

1  ^  1  1  k  '  1  1 .  1 

In.  In. 

1  1 

=  M*N*{b,  ...b,  a . A,  ’  1  l<i<n,x.^  =  b.  ...b.  ,  and  l^k^n} 

It  1  1  k  '  1  1.  1 

In.  In. 

1  1 


Since  all  of  these  sets  are  regular,  G  is  RP . 

By  construction,  no  string  in  can  be  a  prefix  of  a  string 

in  Cg.  Similarly,  no  string  in  can  be  a  prefix  of  a  string 

in  Cg .  • 

Thus  the  set 

{{(wAp,e)  I  weC^  u...  satisfies  the  DRP  definition. 


Since  G  =  (N ' , T ' , P ’ , S^ )  is  the  0-augmented  grammar 
constructed  from  the  grammar  G’  =  (N * = { S^} , ’ , P ’ - { S^  ^  S}S), 
G’  is  a  DRP  grammar.  PGP  is  undecidable  (Hopcroft  and  Ullman' 
1969).  Hence,  we  have  the  following  undecidability  results. 
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Theorem  7 . 7 


Let  G  be  any  DRP  grammar  and  R  any  regular  set. 
The  following  questions  are  undecidable. 


(i) 

Is 

L(G) 

=  ({)? 

(ii) 

Is 

L(G) 

=  R? 

(iii) 

Is 

L(G) 

regular? 

(iv) 

Is 

L(G) 

a  CFL? 

(v) 

Is 

L(G) 

n  R  =  (|)? 

(vi) 

Is 

L  (G) 

E  R? 

(vii) 

If 

G '  is 

another 

(viii) 

Is 

L(G) 

LR(k)  for 

=  L (G  ’  ) ? 


Proof 


(i)  From  the  above  construction  from  any  instance  of 
PGP  a  DRP  grammar  can  be  created.  The  language 
generated  by  this  grammar  is  empty  if  and  only  if 
the  corresponding  instance  of  PGP  has  no  solution. 

The  undecidability  of  PGP  (Hopcroft  and  Ullman  1969), 
means  this  question  must  also  be  undecidable. 

(ii)  If  this  question  were  decidable,  we  could  choose 
R  =  (f)  and  be  able  to  decide  whether  L(G)  =  R  =  (})  . 

From  (i),  this  question  is  not  decidable. 

(iii)  Let  G  =  (N,T,P,S)  be  any  DRP  grammar.  Consider 

the  language  L  =  L ( G ) { @ } { a^b ^ c^  |  n>l},  where 

T  n  {a,b,c}  =  (j)  and  @f^T  u  {a,b,c}.  {a^b^c^  |  n>l} 

is  a  DRP  language.  We  gave  a  DRP  grammar  for  this 
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language  in  Chapter  5.  L  is  a  DRP  language  (Lemma 
7.7)  and  there  is  a  DRP  grammar  G"  that  can  be 
constructed  so  that  L  =  L(G")  . 


If  L(G)  =  0,  then  L  =  (})  and  L  is  a 
Assume  that  if  L(G)  4-  (|)  ,  then  L  is 
can  define  a  homomorphism 
h(Y)  =  fe  if  Ye (T  u  {@}) 

(_Y  if  Y€{a,b,c}. 


regular  set. 
regular.  We 


Regular  sets  are  closed  under  homomorphism  and 
so  h(L)  is  a  regular  set.  But  h(L)  is  {a^b^c^  I  n>l} 
which  is  not  a  regular  set  (Hopcroft  and  Ullman  1969). 

Thus  L  is  not  regular. 

Since  L(G")  is  regular  if  and  only  if  L(G)  =  (|) 

and  G  and  G"  are  DRP,  it  must  be  undecidable  whether 

L(G'')  is  regular. 

(iv)  Gontext  free  languages  are  also  closed  under 

homomorphism.  Thus,  in  part  (iii),  L(G")  is  a 
GFL  if  and  only  if  L(G")  is  regular.  The  result 
follows  immediately  from  this  observation. 

(v)  Let  G  be  any  DRP  grammar  with  terminal  set  T.  L(G)  n  T*  =  0 
if  and  only  if  L(G)  =  0.  Since  T*  is  a  regular  set, 
this  question  is  undecidable. 

(vi)  Let  G  be  any  DRP  grammar  and  R  any  regular  set.  L(G)  E 
if  and  only  if  L(G)  n  R  =  0.  R  is  a  regular  set 
(Hopcroft  and  Ullman  1969  )  .  This  result  follows 
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immediately  from  (v) . 


(vii)  All  regular  sets  are  LR(k)  languages.  Hence,  they 

are  also  DRP  languages.  If  we  could  determine  if  tw 
DRP  languages  are  equal,  we  could  decide  whether  a 
DRP  language  equals  a  regular  set.  However,  tnis 
latter  question  is  undecidable. 

(viii)  The  language  L,  in  part  (iii),  is  LR(k)  if  and  only 
if  L(G)  =  (|) .  Thus  we  cannot  determine  whether  a 
DRP  language  is  LR(k).  □ 

A  very  important  result  can  be  proved  using  part  (ii 
of  this  theorem.  We  have  shown  that  for  an  arbitrary  TOG,G, 
the  following  questions  are  undecidable: 

(i)  IsGRP?  IsG  RRP? 

(ii)  Is  G  DRP?  Is  G  DRRP? 

Of  more  practical  interest  is  the  question,  is  there  a  set 
ACS(G,k),  for  a  given  value  of  k,  that  satisfies  the  DRP 
property.  This  question  is  also  undecidable. 

The  following  technical  lemma  will  be  proved  first. 

Lemma  7 . 9 


Let 

r  =  {G  I  G  is  a  DRP  grammar,  the  left  side  of  each 

production  in  G  begins  with  a  nonterminal  symbol 
and  ACS(G,0)  satisfies  the  DRP  definition}. 
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For  any  GeF,  the  questions 


(i)  L(G)  =  i|l? 

(ii)  L(G)  is  regular? 
are  unde c i dab  1 e  . 

Proof 

(i)  The  grammar  used  in  the  proof  of  Theorem  7.7,  part  (i) 
is  in  r.  Thus  this  result  follows  from  that  theorem, 
(ii)  Consider  the  proof  to  part  (iii)  of  Theorem  7.7. 

Let  G  be  any  grammar  in  T  and  let 

L  =  L  ( G )  { @  }  { a^b  |  n>l}  where  T  n  {a^bjc}  =  (j) 

and  @^Tu{a,b,c}.  The  DRP  grammar  for  {a^b^c^  |  n>l}, 

that  was  described  in  Chapter  5,  is  in  F.  Thus  the 
grammar  G”  of  Theorem  7.7,  part  (iii),  will  also  be 
in  F  .  Consequently,  this  question  is  undecidable.  □ 

Theorem  7 . 8 

Let  G  be  any  TOG  and  k  be  any  non-negative  integer. 

It  is  undecidable  whether  there  is  a  set  ACS(G,k)  that 
satisfies  the  DRP  condition. 

Proof 


The  proof  uses  the  same  construction  that  was 
employed  in  Theorem  7.4.  Let  G’  be  any  DRP  grammar  such 
that  the  left  side  of  each  production  in  G’  begins  with 
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a  nonterminal  symbol  and  ACS(G',0)  satisfies  the  DRP 
definition.  Following  the  construction  used  in  Theorem  7.4, 
we  can  obtain  a  TOG,G,  from  G’ .  The  productions  of  G  include 
productions  of  the  form 

X  '  .  .  .X  '  Y  '  .  .  .  Y  ’ 

1  n  1  m 

where  X^  .  .  .X  Y.  .  .  .Y  is  a  production  of  G'  and  X  '  =  X, 

1  n  1  m  ^  k  1 

if  X.  is  a  nonterminal  of  G'  and  X. ’  =  X  if  X.  =  a  is  a 

1  1  a  1 

terminal  (see  Theorem  7.4  for  the  details)  .  Since  G’  is 
DRP,  it  is  also  RP  (Theorem  5.2)  and  CS(G’)  is  regular.  As 
a  result.  Lemmas  7.1,  7.2,  7.3,  7.4,  7.5  and  7.6  are  still 
valid.  Thus  L(G’)  is  regular  if  and  only  if  CS(G)  is 
regular. 


We  may  observe  the  following  facts  if  CS(G)  is  regular 
(i)  G  is  RP; 

(ii)  Since  ACS(G’,0)  satisfies  the  DRP  condition  it 

can  easily  be  shown  that  ACS(G,0)  also  satisfies  the 
DRP  condition. 

Thus  G  is  DRP  if  and  only  if  L(G')  is  regular.  If  there 
were  an  algorithm  to  decide  if  there  is  a  set  ACS(G,k),  that 
satisfies  the  DRP  condition,  for  any  given  G  and  k,  we  could 
apply  this  algorithm  to  the  constructed  grammar  G  with  k  =  0. 
As  a  result,  we  could  determine  whether  L(G')  is  regular. 
However,  this  question  is  undecidable.  The  theorem  follows 
imme  diately.  □ 

This  result  is  of  extreme  practical  importance. 
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Amongst  other  things,  it  implies  that  at  best  we  must  be 
content  with  parser  generator  algorithms  similar  to  the  one 
in  Chapter  6.  That  algorithm  will  be  successful  with  only 
some  DRP  grammars.  With  others,  it  will  terminate,  unable 
to  decide  whether  the  grammar  is  DRP  or  not. 

Because  DRP  languages  are  closed  under  complementation, 
we  can  derive  several  more  results.  First,  we  require  some 
preliminary  facts,  interesting  in  their  own  right. 

Lemma  7.10 

Let  M  be  any  D2SM  with  the  error  property  and 
let  R  be  any  regular  set.  The  following  questions  are 
undecidab le : 

(i)  Is  L(M)  =  (|)? 

(ii)  Is  L(M)  n  R  =  0? 

Proof 

(i)  For  Theorem  7.7,  part  (i),  we  constructed  a  DRP 

grammar,  G.  L(G)  was  empty  if  and  only  if  the  instance 
of  PGP  from  which  G  was  constructed  had  no  solution. 
Since  we  know  a  set  of  0-augmented  strings  for  G,  we 
can  construct  a  D2SM,M,  with  the  error  property  such 
that  L (M)  =  L(G)  (Theorem  5.5).  If  we  could  decide 
whether  L  (M)  =  (j) ,  we  could  also  decide  whether  L(G)  =  (j)  . 
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Hence,  it  is  unde c i dab  1  e  whether  the  language  acceptlj 

I 

by  an  arbitrary  D2SM  with  the  error  property  is  empt, 

(ii)  Let  M  =  (K,V  ,V  ,T,I)  be  any  D2SM  with  the  error  ^ 

property.  If  we  can  decide  whether  L(M)  n  R  =  (|) , 

we  may  let  R  =  T*  and  determine  whether  L(M)  n  T*  =  , 

However,  L(M)  n  T*  =  if  and  only  if  L(M)  =  (j)  .  Th;;j, 
this  part  of  the  proof  follows  immediately  from  parti 

With  these  results, we  can  prove  the  following  theore 

Theorem  7.9 

For  any  two  DRP  grammars,  =  (N ^  ,  T ^  ,  P ^  ,  S ^ )  and 

^2  ~  (^2 , T 2  ,P 2 > S 2 )  j  and  any  regular  set  R,  the  following 
questions  are  undecidable: 

(i)  Is  L(G^)  = 

(ii)  Is  L(G^)  3  R? 

(iii)  Is  L(G^)  u  L(G2)  a  CFL? 

Proof 

(i)  Let  M  =  (K,V  ,V  ,T,I)  be  any  D2SM  with  the  error 

L  K 

property.  We  can  construct  another  D2SM,M',  with 
the  error  property  such  that  L(M’)  =  L (M)  (Corollary 

to  Theorem  7.5) .  Thus  L(M’)  =  T*  if  and  only  if 

L  (M)  =  (j)  .  If  we  could  decide  whether  L(M')  =  T*  ,  we 

could  decide  whether  L(M)  =  (j).  From  Lemma  7.10,  we  k 
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this  question  is  unde c i d ab 1 e .  Thus,  for  any  D2SM, 

M  =  ( K , , V„ , T , I ) ,  with  the  error  property,  it  is 

undecidable  whether  L (M)  =  T* .  Theorem  5.7  allows 

us  to  contruct  a  DRP  grammar  G,  from  M  such  that 
L(G)  =  L (M) .  T  is  the  set  of  terminal  symbols 
for  G.  If  the  question,  L(G)  =  T*^,  is  decidable  for 
any  DRP  grammar  G  =  (N,T,P,S)  then  we  could  determine 

whether  L (M)  =  T* .  Thus  for  any  DRP  grammar 

G  =  (N,T,P,S),  it  is  undecidable  whether  L(G)  =  T*  . 

(ii)  Let  M  be  any  D2SM  with  the  error  property.  For 

any  regular  set  R,  L(M)  2  R  if  and  only  if  R  n  L (M)  =  0 . 
Assume  that  the  question,  L(M)  2  Rj  is  decidable. 

Let  M"  be  any  D2SM  with  the  error  property  and  let 
R'  be  any  regular  set.  L(M")  n  R’  =  0  i f  and  only 
if  L (M" )  2  R’.  From  the  corollary  to  Theorem  7.5, 
we  can  effectively  construct  another  D2SM,M',  with  the 
error  property  such  that  L (M ' )  =  L (m” ) .  By  assumption 

we  may  decide  whether  L(M')  2  R’  and  so  we  may  determine 
whether  L(M'’)  n  R  =  0  .  This  contradicts  Lemma  7.10. 
Thus,  for  any  D2SM,  with  the  error  property  and  any 
regular  set  R,  the  question  L(M)  2  R  is  undecidable. 
Theorem  5.7  allows  us  to  construct  a  DRP  grammar, 

G,  from  M  such  that  L(G)  =  L(M).  If  the  question, 

L(G)  2  Rj  is  decidable,  we  could  determine  whether 
L(M)  ^  R.  Consequently,  for  any  DRP  grammar,  G,  and 
any  regular  set,  R,  it  is  undecidable  whether  L(G)  ^  R. 
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(N^  ,T2 ,P2 ’ ^2^ 


(iii)  Let  =  (N^,T^,P^,S^)  and  = 

any  two  DRP  grammars  such  that  n  =  0- 

Assume  L(G^)  is  not  a  CFL  (L(G^)  could  be 
{a^b^c^  I  n>l}  for  example).  From  Lemma  7.7, 
if  @^(T^  u  ^2^’  languages  T^*{@}L(G2) 

and  L(G^){@}T2*  are  DRP  and  DRP  grammars  can  be 
constructed  for  them. 

Let  L  =  T^*{(S}L(G2)  U  LCG^Xe}!^*.  If  L(G^)  =  T^*, 

then  L  =  T^*{@}T2*  and  L  is  regular.  Thus  L  is  also 
a  CFL.  If  L(G2)  i=-  T2*  and  L  is  a  CFL,  then  for  some 

X  in  lTg^)  ,  L  n  T^*{(a}{x}  =  L  (G^ )  {(?  }  {x}  .  L(G^){@}{x} 

is  a  CFL  because 

(a)  T^M@}{x}  is  regular; 

(b)  L  is  a  CFL;  and 

(c)  L  n  T^*{@}{x}  is  a  CFL  because  CFL  ’  s  are 
closed  under  intersection  with  a  regular  set 
(Hopcroft  and  Ullman  1969). 

Surely,  L(G^)  can  be  obtained  from  L(G^){@}{x}  by  a 
homomorphism.  Since  CFL  ’  s  are  closed  under  homomorphi 
(Hopcroft  and  Ullman  1969),  L(G^)  is  a  CFL,  a 
contradiction.  Thus  L  is  a  CFL  if  and  only  if  L(G2)  = 
The  undecidability  of  the  second  question  means  it  is 
undecidable  whether  L  is  a  CFL.  This  establishes  this 
part  of  the  theorem.  □ 

The  following  corollary  will  be  useful  in  a  later  sect 
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Corollary 


For  any  D2SM,  M  =  ( K , , T , I ) ,  with  the  error 

property,  it  is  undecidable  whether  L(M)  =  T* . 

Proof 


This  result  was  established  during  the  proof  of 
part  (i)  of  Theorem  7.9.  □ 

DRRP  Languages  -  Closure  and  Decidability  Results 


We  have  devoted  most  of  our  attention  to  DRP 
languages.  A  parser  generator  has  been  developed  for  some 
DRP  grammars.  More  important,  DRP  languages  can  be  parsed 
efficiently  according  to  the  guidelines  set  out  in  Chapter  3. 
DRRP  languages  are  also  a  subset  of  the  class  of  DRP 
languages.  In  this  section,  we  investigate  some  of  their 
closure  and  decidability  properties. 

Theorem  7.10 

The  class  of  DRRP  languages  is  not  closed  -  under 
intersection . 

Proof 


This  proof  uses  a  standard  method  of  proving 


decidability  results  (Rosenkrantz  1969). 


Let  =  (N,T,P,S)  be  any  TOG  such  that  L(G^)  is 

not  a  recursive  language.  Such  a  language  exists  (Hopcroft 
and  Ullman  1969).  Without  loss  of  generality,  we  will 
assume  that  there  are  no  productions  in  of  the  form 

u  ->■  e  ,  Xu  Xv ,  or  riX  ->■  vK ,  and  that  all  sentential  forms 
are  delimited  by  an  endmarker.  An  endmarker  can  be  inserte 
by  using  a  1-augmented  grammar  constructed  from  G^  .  Clearl 
if  L(G^)  is  not  recursive,  then  L(G^){l}  is  not  recursive. 


Productions  of  prohibited  forms  are  eliminated  as 

foil ows : 


( i )  u  ->  e  - 

Replace  u  e  with  uX  ->  X  for  all  Xe  (N  u  T). 
The  language  generated  is  not  changed. 


(  i  i  )  Xu  -)■  Xv  - 


Assume  that  there  are  not  productions  of  the  form 


u  e .  Replace  Xu  ^  Xv  with  Xu  ^  A  v  and  A  ->  X. 

A  A 

A^  is  a  new  nonterminal. 


(  i  i  i  )  uX  ->  vX  - 

Assume  that  productions  of  the  form  u  ->■  e  and  Xu 

have  been  eliminated.  Replace  uX  ^  vX  with  uX  vA. 

and  A  X.  A  is  a  new  nonterminal. 

X  X 
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We  modify  further.  Let  the  productions  in 

be  numbered  from  1  to  m  and  let  {B, .  I  l<i<m,  j  =  1  or  2} 

ij  ' 

be  a  set  of  new  nonterminals.  The  modified  version  of  is 


f 

1 

=  (N 

'  ,T  ,  P 

’  ,  S )  where 

N  ’ 

=  N 

u  {  B  ,  . 

ij 

1  1  <  i  <  m. ,  j 

=  1  or 

2} 

P  ' 

=  {x 

^  ®  *  1 
1 1 

y®i2>  ®ii  ^ 

Z  ,  ,  B  . 
1  1 

2 

Z2  1  x->Z^yZ2 

i  s 

the  i^  production 

i  n 

P} 

{x 

"  hi 

,  Bii  .  Z  1 

X  Z 

i  s 

the  i*"^  production 

Clearly,  L(Gj^)  =  L(Gj^’).  The  new  nonterminals  will  help 

to  identify  the  productions  in  P’ .  Assuming  G^ '  is  in  this 

special  form,  we  construct  two  CFG's,  G^  and  G^.  If  c  is 

a  new  terminal  symbol,  not  in  N'  u  T,  then  strings  in 

L(G„)  and  L(Go)  are  of  the  form  x^cx»c...cx_  ^  where 
I  d  12  2n+l 

x^e(N’  u  T)*  and 


(i) 


if 

any  integer  p,  l<p<n. 


then 


^2n+l 


X  “  ^  ^  V 

2p  Gj  2P-1 


S  and  for 


(ii)  if  x^c .  .  . cx^^^^ eL (G^)  ,  then  x^eT  *  and  for  any 
integers  p,  l<p<n,  ^2v+l  G^ *  ^2P‘' 

Consider  the  language  L  =  L(G2)  n  LCG^).  If 
x^c  .  .  .  cx2^^^eL ,  then 


^2n^l  = 

(ii)  x^eT* 
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(iii)  X-  => *  x_  T  ,  for  all  P(l<P<n)  and 
2p  2p-l 


(iv)  *  ^2p’  P(l<P^n) 


Thus  S  g>  5>  G^---gJ  =^1 


and  x^eL(G^')*  and  G^  are  also  structured  so  that  if 


xeL(G^')  then  there  will  be  a  string  xc  .  .  .  cx2^_|_  ^  in  L.  If 


there  is  an  algorithm  to  determine  if,  for  some  xeT*, 


there  is  a  string  xc .  .  •^^2n+l  then  there  is  an  algorit 


to  see  if  x  is  in  L(G^’)«  But  L(G^’)  is  not  recursive,  a 


contradiction . 


A  more  detailed  description  of 


G2  and  G^  is  given 


b  e 1 ow ; 


^2  ~  (■tS2jA},N’  u  T  u  {c},P2jS2)  where 

S2,A^(n’u  T) 

^2  ~  ^^2  AcS2»S2  S,A  ^  c} 
u  {A  ^  XAX  I  X  (N '  u  T) } 
u  {a  yAx^  I  X  ->  yeP’} 

and  G^  =  ({S2jB,A},N'  u  T  u  {c},P2,S2)  where 

S^,B,Aj^(N’  u  T) 

P3  =  ^^3  B,A  c} 

u{B^aB|aeT} 
u  {B->a  I  ael} 
u  {a  ^  XAX  I  Xe (N ’  u  T) } 

•j*  . 

u  {a  y  Ax  I  X  yeP’}. 
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Both  and  are  LR(0) .  The  special  form  of 

productions  in  guarantees  that  for  any  characteristic 

1C  1C 

string  for  a  production  A  yAx  ,  say  wyAx  Ap  ,  if  y  =  y’Y 
and  X  =  Xx ’ ,  then  Y  ^  X.  Thus  this  characteristic  string 
can  be  distinguished  from  a  characteristic  string  for 
A  ->  XAX .  The  same  is  true  in  G^  for  productions  of  the 
form  A  y  Ax.  The  use  of  special  nonterminals  in  G^  ’ 
means  that  characteristic  strings  for  A  ->  y^Ax^  and 
A  y^Ax^  can  be  distinguished. 

Since  L(G2)  and  L(G^)  are  LR(0)  they  are  also 
DRRP.  If  L  =  L(G2)  n  L(G^)  is  DRRP  then  there  exists  a 
D2SM,M,  with  the  restricted  error  property  that  accepts 
the  language  L{i^}  for  some  k.  Let  k’  be  chosen  so  that, 
if  qt  ->  q't  or  qXt  ->■  (X,q)q’t  is  any  instruction  in  M,  then 
|t|<k’.  For  any  string  xcT*,  the  following  informal 
algorithm  will  determine  whether  x€L(Gj^): 

k  * 

(i)  Let  s  be  any  string  in  (T  u  {c}  u  {l})  . 

(ii)  If  q-xcs  1-*  zqs,  then  there  is  a  string 
i  M 

ye  (T  u  {c})*{i^}  such  that  q  xcsy  |-*  ACCEPT 

k  ^ 

and  xcsyeL{i  }.  Thus  xeL(G^). 

(iii)  M  is  a  halting  D2SM.  If  (ii)  does  not  apply,  then 

an  error  is  found,,in  xcs.  Choose  another  value  for 
s  and  try  (ii)  again.  Since  k'  is  finite,  there  are 
a  finite  number  of  values  for  s.  If  (ii)  does  not 
apply  for  any  value  of  s,  then  x«(L(G^). 
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since  L(G^)  is  not  recursive,  there  can  be  no 
algorithm  to  determine  whether  x  is  in  L(G^) .  Thus, 

L  =  L(G2)  n  LCG^)  cannot  be  DRRP  and  DRRP  languages  are  not 
closed  under  intersection.  □ 

The  following  corollary  is  an  immediate  consequence 
of  this  theorem. 

Corollary 

The  class  of  DRRP  languages  is  not  closed  under  both 
union  and  complementation. 

Proof 

Let  and  be  any  two  DRRP  languages.  If 

n  L2»  then 

L^  u  ^2* 

If  DRRP  1 anguage s  ,  ar e  closed  under  union  and  comp  1 emen t a t ic 
then  is  always  DRRP,  a  contradiction.  □ 

We  conclude  this  chapter  with  some  decidability 
results  for  DRRP  languages  and  grammars.  Because  all  LR(k) 
grammars  are  also  DRRP,  all  questions  that  are  undecidable 
for  LR(k)  grammars  and  languages  are  also  undecidable  for 
DRRP  grammars  and  languages.  Thus  we  have 
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Theorem  7.11 


The  following  questions  are  und e c i d ah  1 e  for  any 
DRRP  grammars  and 

(i)  Is  L(G^)  n  L(G2)  =  (j)? 

(ii)  Is  L(G^)  n  LCG^)  a  CFL? 

(iii)  Is  L(G^)  u  L(G2)  an  LR(k)  language  for  some 

value  of  k? 

(iv)  Is  L (G^)  G  L (G2) ? 

(v)  Is  L(G^)  n  L(G2)  an  LR(k)  language  for  some  value 

o  f  k  ? 

(vi)  Is  L(G^)  u  L(G2)  =  T*  where  T  is  the  common 

terminal  set  of  G^  and  G2? 

Proof 

All  LR(k)  grammars  are  also  DRRP.  All  these  questions 
are  undecidable  for  LR(k)  grammars  G^  and  G2  (Hopcroft  and 
Ullman  1969).  Hence,  these  questions  are  undecidable  for 
DRRP  grammars.  □ 

In  Lemma  7.10,  we  showed  that,  for  any  D2SM,M,  with 
the  error  property,  it  is  undecidable  whether  L(M)  =  (j)  . 

For  D2SM's  with  the  restricted  error  property,  the  emptiness 
problem  is  decidable . 
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The o r em  7.12 


) 

1 


Let  M  be  any  D2SM  with  the  restricted  error  property.  | 


It  is  decidable  whether  L(M)  =  (})  . 


Proof 


Consider  the  start  state,  q^,  of  M.  If  there  are  no 
instructions  applicable  to  then  L  (M)  =  (j)  .  If  there  is 

an  instruction  applicable  to  q^,  then  by  definition  there 
is  a  string  of  terminal  symbols  accepted  by  M.  Thus  L(M)  =  (() 
if  and  only  if  there  are  no  instructions  applicable  to  q^. 
Since  there  are  only  a  finite  number  of  instructions,  we  can 
determine  whether  any  of  them  are  applicable  to  q^.  The 
theorem  follows  immediately.  □ 

Given  any  DRRP  grammar,  we  have  not  developed  an 
algorithm  that  will  construct  a  D2SM  with  the  restricted 
error  property.  Thus,  we  are  unable  to  extend  the  last 
theorem  to  the  grammar  domain.  We  have  been  careful  in 
this  chapter  to  state  decidability  results  only  for  classes 
of  grammars  or  classes  of  parsers.  The  proof  of  similar 
results  for  a  class  of  languages  depends  on  how  these 
languages  are  represented  (by  grammars,  by  machines  that 
recognize  them,  etc) .  Thus,  we  could  say  the  emptiness 
problem  is  decidable  for  DRRP  languages,  if  we  assume  these 
languages  are  represented  by  D2SM's  that  accept  them.  On  the 
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other  hand,  we  cannot  state  a  similar  result  if  the 
languages  are  represented  by  DRRP  grammars. 

The  existence  of  algorithms  to  construct  a  D2SM 
with  the  error  property  from  a  DRP  grammar  or  a  D2SM  with 
the  restricted  error  property  from  a  DRRP  grammar  is  an 
open  question  of  theoretical  interest  only.  In  practice, 
we  must  deal  with  all  TOG’s  and  the  problem  of  deciding 
whether  one  of  these  grammars  is  DRRP  or  DRP  is  undecidable 
(Theorem  5.9). 

Several  negative  results  can  be  obtained  directly 
from  our  results  for  DRP  grammars. 

Theorem  7.13 

Let  G  =  (N,T,P,S)  be  any  DRRP  grammar  and  R  be 

any  regular  set.  The  following  questions  are  undecidable. 

(i)  Is  L(G)  c  R? 

(ii)  Is  L(G)  n  R  =  0? 

(iii)  Is  L (G)  2  R? 

Proof 

Let  M  =  (K , , T , I)  be  any  D2SM  with  the 

error  property.  If  we  change  the  terminal  set  of  M  from 

T  to  V  ,  then  the  new  D2SM,M’  =  (K,V^  ,V„,V„,I)  must  have 

K  L  R  R 

the  restricted  error  property.  Moreover,  L(M')  =  T(M) . 
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(i)  Assume  that,  for  any  DRRP  grammar,  G,  and  any 

regular  set  R,  the  question,  L(G)  c  R,  is  decidable. 

T*  is  a  regular  set.  From  Theorem  5.8,  we  can 
construct  a  DRRP  grammar,  G’,  from  M'  such  that 
L(G’)  =  L(M’).  By  assumption,  we  can  determine  whethe 

t 

L(G’)  ^  T* .  However  L(G’)  ^  T*  if  and  only  if 

L(M’)  ^  T*  if  and  only  if  L(M)  =  4>*  This  contradicts 

Lemma  7.10.  Thus  L(G)  ^  R  is  an  und e c idab 1 e  ,  ques t ion . 

(ii)  L(G)  ^  R  if  and  only  if  L(G)  n  R  =  cf).  R  is  a  regular 
set.  Thus  this  result  follows  immediately  from  (i). 

(iii)  Assume  that  for  any  DRRP  grammar,  G,  and  any  regular 
set  R,  we  can  decide  whether  L(G)  3  R.  T*  is  a 
regular  set.  From  Theorem  5.8,  we  can  construct  a  DRR 
grammar,  G’,  from  M’  such  that  L(M*)  =  L(G’).  By 
assumption,  we  can  determine  whether  L(G')  ^  T*.  But 
L(G')  2.  T*  if  and  only  if  L(M')  3^  T*  if  and  only  if 

L (M)  =  T*.  Thus  we  can  decide  whether  L (M)  =  T* . 

However,  M  was  any  D2SM  with  the  error  property. 

This  contradicts  the  corollary  to  Theorem  7.3.  Thus, 
for  any  DRRP  grammar  G  and  any  regular  set  R,  it  is 
undecidable  whether  L(G)  3  R. 
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DP  Languages  and  Grammars;  Summary 


Since  all  DRP  grammars  are  also  DP,  most  questions 
that  are  undec idab le  for  the  former  are  also  undecidable 
for  the  latter. 

The  various  decidability  results  are  summarized 
in  the  following  table.  The  set  of  terminal  symbols 
for  all  grammars  is  the  set  T.  R  is  any  regular  set. 

A  recursive  grammar  is  any  TOG  that  generates  a  recursive 
language . 
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TYPE  OF  GRAMMAR 


1 

QUESTION 

RECURSIVE 

PM 

Q 

DRP 

DRRP 

1  LR  (k  ) 

L(G^) 

E  L(G2) ? 

U 

u 

u 

U 

u 

L(G^) 

=  lcg^) ? 

U 

u 

u 

9 

9 

L(G) 

=  d)? 

u 

u 

u 

9 

s 

L(G) 

=  T  *? 

u 

u 

u  - 

? 

S 

L(G) 

=  R? 

u 

u 

u 

9 

S 

L(G) 

i  s  regular  ? 

u 

u 

u 

9 

S 

L(G) 

is  LR(k) ,  for  some  k? 

u 

u 

u 

9 

T 

L(G) 

is  a  CFL? 

u 

u 

u 

9 

T 

L(G) 

=  (^? 

u 

u 

u 

9 

S 

L(G) 

i  s  r egul ar  ? 

u 

u 

u 

9 

S 

L(G) 

is  a  CFL? 

u 

u 

u 

9 

T 

L(G^) 

n  LCG^)  =  (t)? 

u 

u 

u 

U 

U 

L(G^) 

u  LCG^)  =  T*? 

u 

u 

u 

u 

U 

L(G^) 

n  LCG^)  is  LR(k) ,  for  some  k? 

u 

u 

u 

u 

U 

L(G^) 

n  LCG^)  is  a  CFL? 

u 

u 

u 

u 

U 

L(G^) 

u  LCG^)  is  LR(k)  for  some  k? 

u 

u 

u 

u 

U 

L(G^) 

u  LCG^)  is  a  CFL? 

u 

u 

u 

? 

T 

L(G) 

n  R  =  (|)? 

u 

u 

u 

u 

S 

L(G) 

c  R? 

u 

u 

u 

u 

S 

L(G) 

3  R? 

u 

u 

u 

u 

S 

L(G) 

is  of  the  same  type? 

T 

? 

T 

? 

T 

L(G^) 

n  LCG^)  is  of  the  same  type? 

T 

? 

9 

? 

u 

G  is 

unambiguous  ? 

- 

T 

T 

T 

T 

S  =  decidable,  U  =  undecidable,  T 

=  trivial ly 

true  , 

?  =  unknowi 
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Cone lus ions 


It  is  well  known  that  no  nontrivial  question  about 
TOL's  (a  question  is  nontrivial  if  there  is  a  TOL  for  which 
the  answer  is  yes  and  another  for  which  the  answer  is  no)  is 
decidable  (Salomaa  1973).  In  this  respect,  DRP  languages 
(represented  by  DRP  grammars)  resemble  the  class  of  TOL’s. 
Although  we  have  not  shown  the  relation  between  these  two 
classes  of  language  (other  than  the  fact  that  all  DRP  languages 
are  recursive),  it  is  fair  to  conjecture  that  DRP  languages 
form  a  large  subset  of  the  class  of  recursive  languages. 
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CHAPTER  8 


APPLICABILITY  OF  OTHER  RESEARCH  TO  TWO 

STACK  MACHINES 


There  has  been  a  great  deal  of  research  into 
topics  related  to  LR(k)  parsing.  These  include 
translation  schemes  (Lewis  and  Stearns  1968,  Aho  and 
Ullman  1972a),  parsing  table  reduction  (Aho  and  Ullman 
1972b,  Anderson  et  al  1973,  Joliat  1973),  and  error 
recovery  (James  1972,  Leinius  1970,  Wynn  1973). 

We  will  examine  each  topic  and  show  that  all 
of  this  research  can  be  extended  to  the  general  case 
of  D2SM's.  It  is  not  our  intention  to  elaborate  these 
extensions  in  detail,  but  rather  to  illustrate  the 
applicability  of  these  results. 

Translation  Schemes 

Syntax  directed  translation  schemes  (see  (Aho 
and  Ullman  1 97 2a) ,  for  a  survey)  are  a  formalism  to 

'•iV. 

describe  the  translation  of  one  string  to  another.  In 
general,  this  translation  is  based  on  the  syntactic 
structure  (usually  described  by  a  CFG)  of  the  first  string. 
Undoubtedly,  there  are  theoretical  results  that  can 


be  obtained  for  general  translations  based  on  TOG's. 
However,  we  are  more  interested  in  the  ability  of 
these  schemes  to  describe  the  interaction  between 
parsing  and  the  code  synthesis  activities  of  a 
comp i 1 er , 

Informally,  a  syntax  directed  translation 
scheme  describes  a  method  of  code  synthesis  geared  to 
the  recognition  of  productions  by  the  parser.  Code 
synthesis  activities  are  initiated  by  the  application 
of  a  rewriting  rule.  Since  a  2SM  represents  a 
generalization  of  the  LR(k)  parser  model,  it  is  clear 
that  this  approach  to  code  synthesis  is  possible. 

Table  Reduction 


Joliat  (Joliat  1973)  has  discussed  very  general 
size  reduction  techniques  for  LR(k)  parsers.  His 
methods  are  more  general  than  those  found  in  (Aho  and 
Ullman  1972b).  While  his  work  is  oriented  to  the 
LR(k)  case,  it  rests  fundamentally  on  the  observation 
that  transitions  to  a  given  state  will  all  involve 
reading  the  same  symbol.  Since  we  have  preserved  (by 
definition)  this  property,  Joliat’s  results  apply  with 
little  or  no  modification. 
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This  result  is  of  great  practical  importance. 
Parsers  account  for  a  nOt  insignificant  proportion  of 
a  compiler’s  space  requirements,  and  Joliat  has 
shown  that  very  large  reductions  in  the  number  of 
states  are  often  possible. 

The  Elimination  of  Look  Ahead 


Joliat's  methods  are  based  on  the  theory  of 
minimization  of  incompletely  specified  machines 
(Kohavi  1969).  In  general,  the  likelihood  of 
combining  two  states  in  a  finite  state  machine 
depends  on  the  number  of  states,  the  number  of 
different  transition  symbols,  and  the  number  of  different 
outputs.  Joliat  treats  the  read,  read-and-look— ahead ,  look 
ahead,  or  reduce  characteristic  of  a  transition  as 

an  output.  If  we  could  eliminate  read— and-look— ahead  transitions 
and  look  ahead  transitions,  it  is  probable  that 
further  reductions  in  the  size  of  the  parser  would 
be  possible. 

Let  G  =  (N,T,P,S)  be  any  DRP  grammar.  For 
some  integer  k,  there  is  a  set  ACS(G,k)  that  satisfies 
the  DRP  definition.  In  general,  a  D2SH  based  on 
ACS(G,k)  will  have  read- and— look-ahead  transitions. 
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If  G^=  k-augmented 

grammar  derived  from  G,  then  construct  a  new  grammar, 

G’=  (N^ , , P ’ , ) .  For  every  production  u  ^  v Ap  in 

P^,P’  contains  a  set  of  productions  of  the  form  ux  ^  vxAp 
such  that  (wvA p , x) e AC S (G , k ) .  Clearly,  L(G')  =L(G  )  and 

A 

ACS(G',0)  satisfies  the  DRP  condition.  Furthermore, 
a  parse  of  the  string  under  G^  can  be  obtained  from 
a  parse  under  G’  because  of  the  simple  correspondence 
between  productions  in  P^  and  P’.  For  example,  a 
reduction  for  productioa  p^  corresponds  to  a  reduction 
for  the  p  production  in  P  . 

A 

In  the  particular  case  of  an  LR(k)  grammar,  G’ 
is  a  context  sensitive  grammar.  This  fact  has  been 
discussed  by  Revesz  (Revesz  1971).  G’  will  also  be 
DRRP .  The  D2SM  parser  allows  us  to  use  these  ideas  in 
practice . 

While  we  have  discussed  this  transformation  in 
terms  of  grammars,  it  is  clear  that  the  changes  can  be 
implemented  directly  on  the  parser  itself.  Furthermore, 
the  D2SM  need  not  have  the  error  property  for  these 
modifications  to  be  possible.  For  example,  any 
deterministic  parser  for  a  CFG,  constructed  using 
Knuth's  LR(0)  constructor  and  any  valid  look  ahead 
algorithm,  regardless  of  whether  it  produces  a  machine 
with  the  error  property,  could  be  altered  so  that  look 
ahead  is  removed. 
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The  modified  D2SM  will  have  read  and  reduce 


transitions  only.  This  fact  will  have  a  large  impact 
on  Joliat’s  reduction  algorithm.  Some  of  the  consequences 
are  : 

(i)  Because  reduce  states  do  not  occupy  any 

space  in  the  parser  table,  the  addition  of 
new  reduce  states  by  the  transformation 
does  not  result  in  an  increase  in  table 
size.  Also,  since  there  are  no  look  ahead 
transitions,  it  is  no  longer  necessary  to 
maintain  information  in  parser  tables  to 
distinguish  read  transitions  from  look 
ahead  transitions.  This  factor  itself  can 
lead  to  a  reduction  in  table  size. 

(ii)  All  states  are  accessed  by  reading  a  unique 

symbol,  including  reduce  states.  Joliat 
(Joliat  1973)  has  pointed  out  that 
this  characteristic  greatly  enhances 
the  success  of  his  reduction  techniques. 

This  is  because  his  state  assignment 
procedure  is  very  much  simplified. 
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(iii)  Because  of  the  elimination  of  transition 
type  information  in  (i),  all  transitions 
represented  in  a  state  table  are 
read  transitions.  Therefore,  the 
type  of  each  transition  need  not 
be  considered  in  Joliat’s  state 
merging  algorithm,  and  in  general 
more  states  can  be  merged. 

Although  the  elimination  of  look  ahead 
creates  more  productions,  no  new  states  are 
introduced  into  the  parser  (see  (i)  above). 
Furthermore,  no  tim.e  penalty  is  incurred  because 
the  time  to  look  ahead  has  simply  been  replaced 
by  the  time  required  to  read.  Thus,  the 
transformations  we  have  suggested  can  result 
in  greater  overall  table  size  reductions  with  no 
loss  of  speed  . 

List  Struc  tur  e  s 

Efficient  data  structures  have  been  designed 
for  the  implementation  of  LE(k)  parsers  (Anderson  et 
al  1973).  The  similarity  in  structure  of  LR(k) 
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parsers  and  D2SM’s  suggests  that  these  methods 
are  directly  applicable  to  the  general  case. 

Error  Recovery 

Informally,  error  recovery  algorithms  for 
LR(k)  parsers  involve  either  the  removal  or 
replacement  of  symbols  in  the  input  or  the  insertion 
of  new  symbols  into  the  input.  In  either  case,  the 
object  is  to  alter  the  input  so  that  parsing  can  be  resumed. 

LR(k)  parsers  detect  an  error  only  when  reading 
or  looking  at  the  beginning  of  the  unread  portion 
of  the  original  input  text.  In  fact,  this  is  true 
for  any  D2SM.  Moreover,  the  structure  of  D2SM’s 
is  a  generalization  of  LR(k)  parser  structure. 

Consequently,  the  fundamental  strategies  underlying 
error  recovery  algorithms  are  applicable  to  general 
D2SM’ s . 


In  (Anderson  and  Rushby  1974),  the  authors  show 
that  look  ahead  in  LR(k)  parsers  plays  a  role  in  error 
detection  and  correction.  For  instance,  if  a  parser  does  not 
look  ahead  before  performing  a  reduction,  it  may  later 
detect  an  error  when  it  finally  reads  or  looks  at 
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the  next  symbol  in  the  input.  If  this  error  is 
detected  by  looking  ahead  before  the  reduction, 
more  information  may  be  available  for  the  purposes 
of  error  recovery.  The  last  section  showed  that 
look  ahead  can  be  elim.inated  from  D2SM's.  In 
D2SM’s  with  the  error  property,  this  transformation 
will  not  negate  any  of  the  advantages  of  look  ahead 
discussed  above. 
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CHAPTER  9 


CONCLUSIONS 


We  have,  in  effect,  proposed  a  new  hierarchy 
of  languages,  defined  in  terms  related  to  programming 
languages  and  compilation.  Classes  of  grammars  are 
defined  that  embody  characteristics  that  allow  strings 
generated  by  them  to  be  parsed  with  varying  degrees 
of  efficiency.  While  subclasses  of  the  Chomsky 
hierarchy  are  defined  in  terms  of  the  form  of  re-writing 
rules  (for  example,  for  any  production  in  a  CSC,  the 
number  of  symbols  on  the  left  side  must  be  less  than 
or  equal  to  the  number  of  symbols  on  the  right  side),  the 
new  classes  are  defined  in  terms  of  constraints  on  the 
form  and  structure  of  canonical  derivations. 

Chomsky  developed  his  hierarchy  of  grammars  in  an 
attempt  to  devise  a  rewriting  system  capable  of  describing 
the  syntax  of  natural  languages  such  as  English.  Although 
these  grammars  have  been  widely  studied  from  that  point  of 
view,  they  have  also  been  used  in  programming  language  and 
automata  theory  research.  Our  new  approach  to  the  subdivision 
of  type-0  grammars  is  aimed  directly  at  the  programming 
language  area  and  has  yielded  some  interesting  and 
practical  results.  We  have  also  investigated  the 
properties  of  these  new  classifications  in  their  own 
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right.  Further  work  into  relating  the  two 
hierarchies  might  also  yield  new  results  in  the 
fields  of  automata  theory  and  linguistics.  This 
is  certainly  an  open  topic  for  research. 

Of  particular  interest  in  our  hierarchy  is 
the  class  of  DRP  grammars.  For  any  of  these  gramm.ars, 
a  parser  can  be  constructed  that 

(i)  halts  on  all  inputs  and  accepts  exactly 
those  strings  that  are  generated  by  the 
grammar ; 

(ii)  performs  a  single  left  to  right  scan 

of  the  input  without  using  backtracking; 

(iii)  detects  errors  as  soon  as  possible  in 
the  parse. 

We  have  concentrated  on  these  grammars  because 
they  meet  the  needs  of  the  compilation  process. 
Decidability  results  in  Chapter  7  suggest  that  DEP 
grammars  form  a  broad  subset  of  type-0  grammars, 

A  parser  generator  algorithm  is  proposed  that 
will  construct  a  parser  for  some  DRP  grammars.  Among 
other  results,  it  is  shown  that 

(i)  the  parser  generator  is  a  generalization 
of  the  LR(k)  algorithm.  In  the  case  of 
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a  CFG,  the  algorithm  behaves  exactly  as  the 
LR(k)  algorithm  does; 

(ii)  these  parsers  detect  errors  as  early  as  possible 

in  the  single  left  to  right  scan  of  the  input;  and 
(iii)  the  time  efficiency  of  the  parser  is  directly 
related  to  the  number  of  non-context  free 
productions  involved  in  the  parse  of  the  input. 

In  the  case  of  a  CFG,  the  parser  requires  only 
linear  time  to  accept  an  input. 

An  investigation  of  the  closure  and  decidability 
properties  of  these  classes  of  grammars  has  been  made.  It  is 
shown  that  there  is  no  analogue  of  the  LR(k)  algorithm  for 
DRP  grammars.  In  particular,  there  is  no  parser  generator 
algorithm  that  will  construct  a  deterministic  parser, 
that  detects  errors  as  early  as  possible  in  a  left  to  right 
scan  of  the  input,  for  any  DRP  grammar. 

LR(k)  grammars  form  a  special  subclass  of  the 
set  of  RP  grammars.  It  is  shown  that  methods  to 
reduce  the  size  of  LR(k)  parsers  are  directly  applicable 
to  the  more  general  parsers  discussed  in  this  thesis. 

Moreover,  the  more  general  parser  techniques  suggest 
a  simple  transformation  that  should  increase  the  amount 
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of  possible  size  reduction.  This  transformation 
can  be  performed  automatically  in  a  parser  generator. 

Methods  for  error  recovery  and  syntax  directed 
code  synthesis  should  also  be  extendable  to  DRP 
grammars  and  their  parsers. 

In  simple  terms,  our  approach  to  the  problem 
of  designing  efficient  parsers  has  been  to  create  an 
abstract  parser  model  that  embodied  the  features  we 
required.  From  this  model,  it  has  been  possible  to 
define  corresponding  classes  of  grammars  and  to  investigate 
their  properties.  The  method  has  been  particularly 
successful  in  the  case  of  DRP  and  DRRP  grammars.  We 
have  chosen  the  LR(k)  concept  as  a  basis  for  our 
studies  because  we  feel  the  principles  involved  allow 
the  most  general  study  of  practical  parsing  without  the 
hinderance  of  unnecessary  constraints.  The  existence 
of  a  large  number  of  results  concerning  LR(k)  grammars  has 
also  been  a  great  help  in  this  research. 

As  is  always  true  when  extending  or 
generalizing  a  concept,  care  must  be  taken  in  how 
this  is  done.  Factors,  which  in  a  particular  case 
are  of  no  importance,  must  be  considered  in  the 
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general  case.  Thus,  although  all  CFG’s  are  RP ,  this 
is  not  true  of  all  TOG’s  and  so  any  parser  generator 
for  a  subset  of  the  DRP  grammars  must  check  for  both 
determinism  and  regular  parsability. 

Other  attempts  have  been  made  to  extend  LR(k) 
parsing  methods  (Walters  1970).  We  feel  our 
generalization  is  more  appropriate  because  we  have 
paid  particular  attention  to  practical  problems.  The 
parsers  we  have  studied  will  halt  on  all  inputs  and 
the  parser  generators  we  have  discussed  will  also 
terminate.  These  and  'similar  factors  must  be  taken 
into  account  in  any  study  of  this  kind. 

Future  Research 


We  have  already  suggested  that  further 
investigations  into  the  properties  of  RP  grammars  and 
their  relation  to  other  known  grammatical  classes  could 
yield  interesting  results. 

There  are  three  particular  open  questions  that 
of  interest.  We  have  described  a  p'arser  generator 
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in  Chapter  6.  Although  there  is  no  parser  generator 
algorithm  for  all  DR.?  grammars,  improvements  to  the 
present  algorithm  could  extend  the  class  of  grammars 
for  which  parsers  can  be  constructed.  These 
improvements  might  include  new  look  ahead  algorithms. 

The  efficient  implementation  of  the  parser  generator 
is  also  an  interesting  problem.  It  is  likely  that  a  large 
number  of  states  will  be  created  by  the  algorithm. 

Although  table  reduction  techniques  can  be  applied  to 
yield  a  more  reasonable  parser  size,  methods  of 
constraining  the  intermediate  number  of  states  would 
be  very  useful.  Of  course,  the  efficiency  of  the  parser 
generator  is  not  as  important  as  the  efficiency  of  the 
parser  itself,  since  a  parser  is  constructed  only  once, 
but  used  many  times. 

The  second  open  area  of  research  is  suggested 
by  our  discussion  in  Chapter  4.  Although  we  have 
emphasized  the  importance  of  the  error  property,  there 
are  occasions  when  it  may  not  be  required.  However, 
we  still  require  that  a  parser  halt  on  all  inputs  so 
that  the  parser  generator  of  Chapter  3  is  not  sufficient. 

A  suitable  parser  generator  may  allow  the  construction 
of  halting  parsers  for  non-DRP  grammars.  We  have  indicated, 
in  very  broad  terms,  how  this  problem  might  be  tackled. 
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Finally,  perhaps  the  most  interesting  open  question 


is  the  application  of  some  of  these  classes  of  gra  mm  a  r  s  and 

■parsers  to  restructuring  and  transforming  in  a  compiler,  as 

iDe  Remer  (De  Remer  1973)  has  suggested.  A  possible  conceptual 
i 

starting  point  is  to  view  source  programs  in  two  ways.  On  the 

;one  hand,  the  structure  of  the  source  program,  as  it  appears 

in  a  program  listing,  is  designed  with  programming  in  mind. 

Thus  the  source  program  may  contain  redundant  information  to 

imake  programs  more  readable  and  the  process  of  programming 

less  error  prone.  On  the  other  hand,  there  may  be  a  semantically 

equivalent  but  syntactically  different  structure  that  is  more 

amenable  to  efficient  code  synthesis  and  generation.  The 

transformation  from  one  structure  to  the  other  could  be 

accomplished  using  a  grammar  that  is  not  context  free.  Scanners 

perform  some  of  these  tasks  already.  The  deleting  of  blanks 

and  comments  from  the  input  stream  is  an  example  of  the  removal 

iof  redundant  information.  De  Remer  has  suggested  that  work  in 
j 

■this  area  may  become  just  as  important  as  previous  studies  of 
i 

{parsing  based  on  context  free  grammars.  Our  results  could 
iprovide  a  basis  for  some  of  this  new  research. 
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